Introduction
This project consists on the analysis of a data base related with used cars and their corresponding prices, among with other features of cars. More precisely, we will be classifying cars with regard of the amount of pollution they generate. Also, the prices of these used cars will be predicted by building various Machine Learning models.
With this analysis, we have the opportunity to see how cars are related with pollution, depending on their fuel index, fuel type…In addition, we would have an idea of how much a used car costs depending on their model, category, manufacturer, mileage, engine volume….
This study is based on the data from a hiring competition.
Data Pre-processing
rm(list=ls())
setwd("~/Desktop/Statistical Learning/HW 2")
# LIBRARIES
library(ggplot2)
library(plotly)
library(dplyr)
library(tidyverse)
library(MASS)
library(caret)
library(e1071)
library(GGally)
library(rpart)
library(rpart.plot)
library(randomForest)
library(leaps)
library(olsrr)
library(caret)
library(glmnet)
library(pROC)In this section, we will first explore the data. This data contains 19237 observations and 18 variables. Each row corresponds to a car that has been used before, although we will see that some of them may not been used before (as they have 0 mileage).
The columns of the data correspond to the following:
ID: identification number for each car.
Price: price of the car.
Levy: the motor vehicle levy helps cover the cost of accidents on public roads involving moving vehicles. This variable indicates the amount of levy paid for each car.
Manufacturer: business entity that has produced the automobile.
Model: car design of the manufacturer.
Prod.year: year of production.
Category: type of car.
Leather.interior: whether a car has leather in the interior.
Fuel.type: Car fuel type (i.e gas or diesel)
Engine.volume: total volume of the cylinders in the engine.
Mileage: km that the car has traveled
Cylinders: number of cylinders of the car (vital part of the engine where fuel is combusted and power is generated)
Gear.box.type: type of transmission (i.e automatic or manual)
Drive.wheels: type of drive of the wheels
Doors: number of doors
Wheel: wheel base of a car
Color: color of the car
Airbags: number of airbags available in the car
Data pre-processing taking into account the main goal of the project: car price prediction and classification of a new variable that will be created related to the amount of pollution that a car generates. First let us take an insight of the data structure and its composition.
# read the csv file
data = read.csv("Cars.csv")
head(data)## ID Price Levy Manufacturer Model Prod..year Category
## 1 45654403 13328 1399 LEXUS RX 450 NA Jeep
## 2 44731507 16621 1018 CHEVROLET Equinox 2011 Jeep
## 3 45774419 8467 - HONDA FIT 2006 Hatchback
## 4 45769185 3607 862 FORD Escape 2011 Jeep
## 5 45809263 11726 446 HONDA FIT 2014 Hatchback
## 6 45802912 39493 891 HYUNDAI Santa FE 2016 Jeep
## Leather.interior Fuel.type Engine.volume Mileage Cylinders Gear.box.type
## 1 Yes Hybrid 3.5 186005 km 6 Automatic
## 2 No Petrol 3 192000 km 6 Tiptronic
## 3 No Petrol 1.3 200000 km 4 Variator
## 4 Yes Hybrid 2.5 168966 km 4 Automatic
## 5 Yes Petrol 1.3 91901 km 4 Automatic
## 6 Yes Diesel 2 160931 km 4 Automatic
## Drive.wheels Doors Wheel Color Airbags
## 1 4x4 04-May Left wheel Silver 12
## 2 4x4 04-May Left wheel Black 8
## 3 Front 04-May Right-hand drive Black 2
## 4 4x4 04-May Left wheel White 0
## 5 Front 04-May Left wheel Silver 4
## 6 Front 04-May Left wheel White 4
# structure
str(data)## 'data.frame': 19237 obs. of 18 variables:
## $ ID : int 45654403 44731507 45774419 45769185 45809263 45802912 45656768 45816158 45641395 45756839 ...
## $ Price : int 13328 16621 8467 3607 11726 39493 1803 549 1098 26657 ...
## $ Levy : chr "1399" "1018" "-" "862" ...
## $ Manufacturer : chr "LEXUS" "CHEVROLET" "HONDA" "FORD" ...
## $ Model : chr "RX 450" "Equinox" "FIT" "Escape" ...
## $ Prod..year : int NA 2011 2006 2011 2014 2016 2010 2013 2014 2007 ...
## $ Category : chr "Jeep" "Jeep" "Hatchback" "Jeep" ...
## $ Leather.interior: chr "Yes" "No" "No" "Yes" ...
## $ Fuel.type : chr "Hybrid" "Petrol" "Petrol" "Hybrid" ...
## $ Engine.volume : chr "3.5" "3" "1.3" "2.5" ...
## $ Mileage : chr "186005 km" "192000 km" "200000 km" "168966 km" ...
## $ Cylinders : int 6 6 4 4 4 4 4 4 4 6 ...
## $ Gear.box.type : chr "Automatic" "Tiptronic" "Variator" "Automatic" ...
## $ Drive.wheels : chr "4x4" "4x4" "Front" "4x4" ...
## $ Doors : chr "04-May" "04-May" "04-May" "04-May" ...
## $ Wheel : chr "Left wheel" "Left wheel" "Right-hand drive" "Left wheel" ...
## $ Color : chr "Silver" "Black" "Black" "White" ...
## $ Airbags : int 12 8 2 0 4 4 12 12 12 12 ...
summary(data)## ID Price Levy Manufacturer
## Min. :20746880 Min. : 1 Length:19237 Length:19237
## 1st Qu.:45698355 1st Qu.: 5331 Class :character Class :character
## Median :45772306 Median : 13172 Mode :character Mode :character
## Mean :45576555 Mean : 18556
## 3rd Qu.:45802035 3rd Qu.: 22075
## Max. :45816654 Max. :26307500
## NA's :20
## Model Prod..year Category Leather.interior
## Length:19237 Min. :1939 Length:19237 Length:19237
## Class :character 1st Qu.:2009 Class :character Class :character
## Mode :character Median :2012 Mode :character Mode :character
## Mean :2011
## 3rd Qu.:2015
## Max. :2020
## NA's :10
## Fuel.type Engine.volume Mileage Cylinders
## Length:19237 Length:19237 Length:19237 Min. : 1.000
## Class :character Class :character Class :character 1st Qu.: 4.000
## Mode :character Mode :character Mode :character Median : 4.000
## Mean : 4.583
## 3rd Qu.: 4.000
## Max. :16.000
## NA's :20
## Gear.box.type Drive.wheels Doors Wheel
## Length:19237 Length:19237 Length:19237 Length:19237
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Color Airbags
## Length:19237 Min. : 0.000
## Class :character 1st Qu.: 4.000
## Mode :character Median : 6.000
## Mean : 6.581
## 3rd Qu.:12.000
## Max. :16.000
## NA's :24
# 19237 observations and 18 variables
# 5 numeric columns (integer) and 13 categorical columns (characters)As it has been stated, there are 19237 observations and 18 variables, where we can find 5 numeric columns and 13 categorical columns.
Before the analysis, we must prepare the input.
Adjustment and creation of variables
I don’t like how the production year is written so I will rewrite it to make it look uniform with the names of the other columns of the data set.
colnames(data)[6] = "Prod.year"Mileage column gives us how many km the car has driven. “km” is written in the column after each reading. However I will remove it, and convert to a numeric variable for a better use.
data$Mileage = gsub('km', '', data$Mileage)
data$Mileage = as.numeric(data$Mileage)A car that has 0 km traveled is the same as a brand new car, as it hasn’t been used before. Therefore, we will create a new variable (New) to know whether a car is new or not.
# if a car has 0 km will mean that is new
# let us create a variable for this kind of cars
data$New = (data$Mileage == 0)
# TRUE = Yes, new
# FALSE = No, not new
data$New = ifelse(data$New == TRUE, "Yes", "No")
# nas doesn't mean 0 km so, we will just set the nas of mileage to no brand new in this variable
for (i in 1:nrow(data)){
if(is.na(data[i,19])){
data[i,19] = 0
}else if(data[i,19] == "No"){
data[i,19] = 0
}else{
data[i,19] = 1
}
}
# 1 -> brand new
# 0 -> used
data$New = as.numeric(data$New)
sum(data$New == 1) / 19237 # number of brand new cars## [1] 0.03737589
We can see that there are not many brand new cars.
In the Engine.volume column, we see that some cars have also written their type of engine; that is turbo, or not turbo. We will just create a new column that shows the type of engine and then remove the turbo part of this variable
# grepl function to test whether a character is in a string
data$Turbo = grepl("Turbo", data$Engine.volume)
data$Turbo = ifelse(data$Turbo == TRUE, "Yes", "No")
for(i in 1:nrow(data)){
if(data[i, 20] == "Yes"){
data[i, 20] = 1
}else{
data[i, 20] = 0
}
}
# 0 -> no turbo
# 1 -> turbo
data$Turbo = as.numeric(data$Turbo)
# if TRUE, the car is Turbo Engine
sum(data$Turbo == 1) / 19237 # 0.1% of the cars are turbo ## [1] 0.1002235
# now we can remove the Turbo part of Engine.volume and convert the variable to numeric
data$Engine.volume = gsub('Turbo', '', data$Engine.volume)
data$Engine.volume = as.numeric(data$Engine.volume)Regarding the Doors column, we will make some changes to their information. “04-May” will just be simplified to 4, as it just means that the car has 4 doors. Same thing will be done for “02-Mar”, corresponding to 2 doors. The rest of cars will have more than 5 doors (for instance a mini bus).
# Doors column
unique(data$Doors)## [1] "04-May" "02-Mar" ">5"
sum(is.na(data$Doors)) #check just in case## [1] 0
# 04-may just means 4
# 02-mar just means 2
for (i in 1:nrow(data)){
if(data[i,15] == "04-May"){
data[i, 15] = 4
}else if(data[i,15] == "02-Mar"){
data[i,15] = 2
}
}
# it cannot be converted to a numeric variable because >5 is undefinedHandling missing values
# checking for missing data
sum(is.na(data))## [1] 102
colSums(is.na(data))## ID Price Levy Manufacturer
## 20 0 0 0
## Model Prod.year Category Leather.interior
## 0 10 0 0
## Fuel.type Engine.volume Mileage Cylinders
## 0 12 16 20
## Gear.box.type Drive.wheels Doors Wheel
## 0 0 0 0
## Color Airbags New Turbo
## 0 24 0 0
We can see that there are NAs in some numeric columns: ID, Prod..year, Engine Volume, Mileage, Cylinders and Airbags.
NAs in ID will not really be significant as ID is a column that doesn’t provide relevant information for the approach of this analysis. In fact, i’m just going to remove this column not because of containing NA’s but due to its lack of contribution to the goal of the project. From my point of view, it does not provide information about car prices nor car pollution.
data = data[, -1]NAs in Production year
We just have 10 cars with no information related with their year of production. Thus, in this case we can analyze each of them and, considering the model of the car, assign the average year of production of those cars.
nas_year = which(is.na(data$Prod.year))
# returns the index of the row with the NA value in Prod.year
# I'm going to create a variable with the models of the cars with NA's in Prod.Year
# with this variable, i will then compute the avg Prod.Year of all the cars in the dataset
# with each of these models to later assign that value to the NA's that I have
for(car in nas_year){
data$Prod.year[car] = round(mean(data$Prod.year[data$Model == data$Model[car]], na.rm = TRUE))
}
colSums(is.na(data))## Price Levy Manufacturer Model
## 0 0 0 0
## Prod.year Category Leather.interior Fuel.type
## 0 0 0 0
## Engine.volume Mileage Cylinders Gear.box.type
## 12 16 20 0
## Drive.wheels Doors Wheel Color
## 0 0 0 0
## Airbags New Turbo
## 24 0 0
# successfully handledNAs in Engine Volume
We have 12 missing values (zeros). The engine volume of a car refers to the volume of fuel and air that can be pushed through a car’s cylinders and is measured in cubic centimetres (cc). Then we can predict the NA’s with the cylinders column.
nas_eng = which(is.na(data$Engine.volume))
# for each NA, i will assign
for(car in nas_eng){
data$Engine.volume[car] = round(mean(data$Engine.volume[(data$Cylinders == data$Cylinders[car]) & (data$Model == data$Model[car])], na.rm = TRUE))
}
colSums(is.na(data))## Price Levy Manufacturer Model
## 0 0 0 0
## Prod.year Category Leather.interior Fuel.type
## 0 0 0 0
## Engine.volume Mileage Cylinders Gear.box.type
## 1 16 20 0
## Drive.wheels Doors Wheel Color
## 0 0 0 0
## Airbags New Turbo
## 24 0 0
# all nas have been succesfully handled except for one
which(is.na(data$Engine.volume))## [1] 9032
data$Cylinders[which(is.na(data$Engine.volume))]## [1] NA
# apparently, this certain car has an NA in its cylinders, which is the reason why
# it hasn't been able to be completely handled
# for this particular case, i will estimate the na just considering the model of the car
data$Engine.volume[which(is.na(data$Engine.volume))] = round(mean(data$Engine.volume[data$Model == data$Model[car]], na.rm = TRUE))
colSums(is.na(data))## Price Levy Manufacturer Model
## 0 0 0 0
## Prod.year Category Leather.interior Fuel.type
## 0 0 0 0
## Engine.volume Mileage Cylinders Gear.box.type
## 0 16 20 0
## Drive.wheels Doors Wheel Color
## 0 0 0 0
## Airbags New Turbo
## 24 0 0
# now its okNAs in Mileage
We have 16 missing values. It is difficult to estimate the amount of km of a car. However I will based the estimation on the average km of the production year of the car.
nas_km = which(is.na(data$Mileage))
# returns the index of the row with the NA value in Mileage
# for each NA, i will assign the avg number of km of the cars of the same production year and model as the NA
for(car in nas_km){
data$Mileage[car] = round(mean(data$Mileage[data$Prod.year == data$Prod.year[car]], na.rm = TRUE))
}
colSums(is.na(data))## Price Levy Manufacturer Model
## 0 0 0 0
## Prod.year Category Leather.interior Fuel.type
## 0 0 0 0
## Engine.volume Mileage Cylinders Gear.box.type
## 0 0 20 0
## Drive.wheels Doors Wheel Color
## 0 0 0 0
## Airbags New Turbo
## 24 0 0
# successfully handledNAs in Cylinders
Cylinders are the power unit of an engine. We have 20 missing values which can be handled by using the engine column.
nas_cyl = which(is.na(data$Cylinders))
for(car in nas_cyl){
data$Cylinders[car] = round(mean(data$Cylinders[data$Engine.volume == data$Engine.volume[car]], na.rm = TRUE))
}
colSums(is.na(data))## Price Levy Manufacturer Model
## 0 0 0 0
## Prod.year Category Leather.interior Fuel.type
## 0 0 0 0
## Engine.volume Mileage Cylinders Gear.box.type
## 0 0 0 0
## Drive.wheels Doors Wheel Color
## 0 0 0 0
## Airbags New Turbo
## 24 0 0
# successfully handledNAs in Airbags
We have 24 missing values. The airbag, which today is also something normal, was invented in 1971 by Mercedes-Benz, but it was not installed in a car until 1981, and it was not until the 90s that it began to become widespread in all kinds of makes and models. Since 2006 it is mandatory for all cars to have at least two airbags in the front part of the car.
We will take into account this regulation to handle the cars with NAs or 0 airbags as it makes no sense that some cars had airbags before they were even invented (it may be some kind of error).
For all cars with Prod.year < 1981, we will set the number of airbags to 0
For all cars with 1981 < Prod.year < 2006, if there is a NA it will just be set to 0
For all cars with Prod.year > 2006, if there is a NA it will be set to 2 (mandatory)
for (i in 1:nrow(data)){
if(data[i,5] < 1981){ # Prod.year < 1981
data[i,17] = 0 # set airbags to 0
}else if((data[i, 5] < 2006) & is.na(data[i, 17])){ # Prod.year between 1981 and 2006 and NA in airbag
data[i, 17] = 0
}else if(is.na(data[i, 17])){ # Prod.year > 2006 and NA in airbag
data[i, 17] = 2
}
}
colSums(is.na(data))## Price Levy Manufacturer Model
## 0 0 0 0
## Prod.year Category Leather.interior Fuel.type
## 0 0 0 0
## Engine.volume Mileage Cylinders Gear.box.type
## 0 0 0 0
## Drive.wheels Doors Wheel Color
## 0 0 0 0
## Airbags New Turbo
## 0 0 0
# successfully handledThus, for all cars with 0 airbags that have been produced in 2006 or later on, we will just set their number of airbags to 2 (as it is the least amount that they must have for sure).
NAs in Levy
After analyzing the Levy column, I found out that it does contain missing values but they were given as ‘-’ in the data and that’s why we were not able to capture the missing values earlier. Here we could impute ‘-’ in the ‘Levy’ column with ‘0’ assuming there was no ‘Levy’. Or we could impute it with ‘mean’ or ‘median. However, we cannot set these values to 0 as there is a a mandatory insurance for a car (“Seguro obligatorio a terceros”) which is equal to 87 (we can see that it is the minimum levy for the rest of the cars). Therefore, for every’-’ value, we will just set it to 87.
for (i in 1:nrow(data)){
if(data[i, 2] == "-"){
data[i, 2] = "87"
}
}
data$Levy = as.numeric(data$Levy)Creating more variables
Now we will create the Pollution variable that will be later used for the classification part. This variable will be categorical, with values “HIGH”, “MEDIUM” or “LOW” depending on the pollution generated by the car.
In order to determine in which rank a car is, we will use a formula (for generating a number = index to later assign each group to a certain interval depending on this index). For it, we will also create a variable that will contain the fuel index of a car, depending on the CO2 emissions.
After making a research on automobile sites, I have come up with this formula:
Pollution index = Fuel index * ( Engine volume + Cylinders )
unique(data$Fuel.type)## [1] "Hybrid" "Petrol" "Diesel" "CNG"
## [5] "Plug-in Hybrid" "LPG" "Hydrogen"
The Coefficient of Fuel is based on the calculation of co2 emissions of each type of Fuel.
| Type of Fuel | CO2 emissions (in kg per liter) |
|---|---|
| Diesel | 2.4 |
| Petrol | 2.3 |
| LPG (Licuated Petrol Gas) | 1.51 |
| CNG (Cars with Natural Gas) | 1.39 |
| Hybrid | 0.7 |
| Plug-in-hybrid | 0.5 |
| Hydrogen | 0.1 |
Considering the most pollutant fuel as diesel, we can compute the indexes for the rest of the fuels and create a new column, containing them depending on the car.
# establishing the CO2 emissions
co2_emission_diesel = 2.4
co2_emission_petrol = 2.3
co2_emission_lpg = 1.51
co2_emission_cng = 1.39
co2_emission_hybrid = 0.7
co2_emission_plug_in = 0.5
co2_emission_hydrogen = 0.1
# computing the coefficients
diesel = round(co2_emission_diesel / co2_emission_diesel, 2)
petrol = round(co2_emission_petrol / co2_emission_diesel, 2)
lpg = round(co2_emission_lpg / co2_emission_diesel, 2)
cng = round(co2_emission_cng / co2_emission_diesel, 2)
hybrid = round(co2_emission_hybrid / co2_emission_diesel, 2)
plug_in = round(co2_emission_plug_in / co2_emission_diesel, 2)
hydrogen = round(co2_emission_hydrogen / co2_emission_diesel, 2)
data$Fuel.idx = 0
# column for determining the index of the fuel type by means of pollution
for(i in 1:nrow(data)){
if(data[i, 8] == "Diesel"){
data[i, 20] = diesel
}else if(data[i, 8] == "Petrol"){
data[i, 20] = petrol
}else if(data[i, 8] == "LPG"){
data[i, 20] = lpg
}else if(data[i, 8] == "CNG"){
data[i, 20] = cng
}else if(data[i, 8] == "Hybrid"){
data[i, 20] = hybrid
}else if(data[i, 8] == "Plug-in Hybrid"){
data[i, 20] = plug_in
}else{ # hydrogen
data[i, 20] = hydrogen
}
}We have to create the Pollution variable. This column will have the correspondent result to the previous formula applied on each car to classify the vehicles depending on how pollutant they are.
data$Pollution = 0 # creating the column and initializing it to 0
# index of fuel * (Engine.volume + Cylinders)
for(i in 1:nrow(data)){
data[i, 21] = data[i, 20] * (data[i, 9] + data[i, 11])
}
summary(data$Pollution)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.336 4.032 5.760 5.689 6.300 18.720
Right now pollution contains the result of the formula. Now cars will be classified as:
HIGH: pollution > 6.5
MEDIUM: pollution < 4
LOW: pollution <= 4
for(i in 1:nrow(data)){
p = data[i, 20] * (data[i, 9] + data[i, 11])
if(p > 6.5){ # HIGH
data[i, 21] = "HIGH"
}else if(p < 4){ # LOW
data[i, 21] = "LOW"
}else{ # MEDIUM
data[i, 21] = "MEDIUM"
}
}Price is going to be the target column/dependent feature for the regression part of the project.
Pollution will be the target column for the classification part.
Outliers
Studying the outliers of the numeric variables by IQR:
Outliers in Price
QI = quantile(data$Price, 0.25)
QS = quantile(data$Price, 0.75)
IQR = QS-QI
names(QI) = NULL
names(QS) = NULL
sum(data$Price < QI - 1.5*IQR | data$Price > QS + 1.5*IQR)## [1] 1073
ggplot(data = data, mapping = aes(y = Price)) +
geom_boxplot(outlier.color = "red", outlier.shape = 4) +
ggtitle(expression(atop("Boxplot of Price", atop(italic("In Red Outliers"))))) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(plot.subtitle = element_text(hjust = 0.5))outliers = c()
data$Price = as.numeric(data$Price)
for(i in 1:nrow(data)){
if(data[i, 1] < (QI - 1.5*IQR) | data[i, 1] > (QS + 1.5*IQR)){
outliers = c(outliers, i)
}
}
a = head(which(data$Price < (QI - 1.5*IQR) | data$Price > (QS + 1.5*IQR)))
prices = c()
for (i in a){
prices = c(prices, data[i, 1])
}
prices## [1] 59464 51746 55390 87112 53154 77775
As we can see the outliers of the price variables are very large numbers. If we take a look at the data, we have very small values for prices of cars, which someone may think that it does not make much sense. However, this can be possible as in some cases it is cheaper to sell a car (usually old cars) at a very low price than to take it to the car scrapping. Indeed, they may be cars that do not work well and need to be fixed. We should also take into account that this data set is mainly of used cars. For these reasons, cars with low prices make sense and we can see in the data set that these kind of cars have a very large number of mileage which, again, is in favor of this suggested idea.
However, in order to make the comparison and analysis of prices reasonable, I will remove all cars which are not from the 21st century. From my point of view, it doesn’t make sense to be comparing prices within an interval of time that big (1939-2020) because of factors such as the inflation and the continuous change of currency.
not21 = c() # vector which will contain the position of all the cars which are not form the 21st century
for(i in 1:nrow(data)){
if(data[i, 5] < 2000){
not21 = c(not21, i)
}
}
length(not21) # 986 is not even 0.1% of the data## [1] 982
data = data[-not21, ]This change doesn’t really affect our data, as cars from those years were not even 0.1% of the data. Now that we are just considering cars of the 21st century, we can safely compute the outliers in Price.
QI = quantile(data$Price, 0.25)
QS = quantile(data$Price, 0.75)
IQR = QS-QI
names(QI) = NULL
names(QS) = NULL
sum(data$Price < QI - 1.5*IQR | data$Price > QS + 1.5*IQR)## [1] 1014
ggplot(data = data, mapping = aes(y = Price)) +
geom_boxplot(outlier.color = "red", outlier.shape = 4) +
ggtitle(expression(atop("Boxplot of Price", atop(italic("In Red Outliers"))))) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(plot.subtitle = element_text(hjust = 0.5))From the boxplot, we see that many cars have a cheap price, but this can certainly be possible as our data set is not just of new cars, we have a very wide number of car categories.
Let us remove the outliers.
outliers = c()
for(i in 1:nrow(data)){
if(data[i, 1] < (QI - 1.5*IQR) | data[i, 1] > (QS + 1.5*IQR)){
outliers = c(outliers, i)
}
}
data = data[-outliers, ]Outliers in Levy
QI = quantile(data$Levy, 0.25)
QS = quantile(data$Levy, 0.75)
IQR = QS-QI
sum(data$Levy < QI - 1.5*IQR | data$Levy > QS + 1.5*IQR)## [1] 189
ggplot(data = data, mapping = aes(y = Levy)) +
geom_boxplot(fill = "slategray1", outlier.color = "red", outlier.shape = 4) +
ggtitle(expression(atop("Boxplot of Levy", atop(italic("In Red Outliers"))))) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(plot.subtitle = element_text(hjust = 0.5))We can see that there are several outliers that we can safely remove.
outliers = c()
for(i in 1:nrow(data)){
if(data[i, 2] < (QI - 1.5*IQR) | data[i, 2] > (QS + 1.5*IQR)){
outliers = c(outliers, i)
}
}
data = data[-outliers, ]Outliers in Year of Production
ggplot(data = data, mapping = aes(y = Prod.year)) +
geom_boxplot(fill = "darkolivegreen1", outlier.color = "red", outlier.shape = 4) +
ggtitle(expression(atop("Boxplot of Production year", atop(italic("In Red Outliers"))))) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(plot.subtitle = element_text(hjust = 0.5))For this variable, we see there are several outliers but it is not necessary to remove them, because we want to consider all the cars from the 21st century. We may have more cars from 2010 until now, but that doesn’t mean that we don’t want to consider the ones from years such as 2000, beacuse they are still part of our analysis. Therefore, it not necessary to remove the outliers of this variable.
Outliers in Engine Volume
QI = quantile(data$Engine.volume, 0.25)
QS = quantile(data$Engine.volume, 0.75)
IQR = QS-QI
sum(data$Engine.volume < QI - 1.5*IQR | data$Engine.volume > QS + 1.5*IQR)## [1] 987
ggplot(data = data, mapping = aes(y = Engine.volume)) +
geom_boxplot(fill = "thistle1", outlier.color = "red", outlier.shape = 4) +
ggtitle(expression(atop("Boxplot of Engine Volume", atop(italic("In Red Outliers"))))) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(plot.subtitle = element_text(hjust = 0.5))From the boxplot we see that we have several outliers, which may influence our analysis on the classification or prediction part as this variable seems to be relevant for a car (from its definition, we will see later if this is true or not though)
outliers = c()
for(i in 1:nrow(data)){
if(data[i, 9] < (QI - 1.5*IQR) | data[i, 9] > (QS + 1.5*IQR)){
outliers = c(outliers, i)
}
}
data = data[-outliers, ]Outliers in Mileage
QI = quantile(data$Mileage, 0.25)
QS = quantile(data$Mileage, 0.75)
IQR = QS-QI
sum(data$Mileage < QI - 1.5*IQR | data$Mileage > QS + 1.5*IQR)## [1] 623
ggplot(data = data, mapping = aes(y = Mileage)) +
geom_boxplot(outlier.color = "red", outlier.shape = 4) +
ggtitle(expression(atop("Boxplot of Mileage", atop(italic("In Red Outliers"))))) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(plot.subtitle = element_text(hjust = 0.5))From the boxplot, we can observe there are some outliers. We should take into account that this variable is very relative as it depends on the owner of the car, its preferences (it may be a person that loves to travel a lot and so she/he uses the car frequently)… Therefore, I feel like we should keep the outliers as it could help with the analysis by the characteristics that these kind of cars (the ones that have consumed a very large distance) have and how they affect the price variable.
Outliers in Cylinders
Let us compute them by the 3-sigma rule.
mu = mean(data$Cylinders)
sigma = sd(data$Cylinders)
sum(data$Cylinders < mu - 3*sigma | data$Cylinders > mu + 3*sigma)## [1] 102
Cylinders’ outliers will not be removed as we have already removed Engine Volume outliers which is a variable that is related with cylinders (the engine volume of a car depends on the cylinders of it). At the same time, it may be interesting for the analysis to see how the price and the pollution of a car is affected by these outliers. Indeed they may be contribute for a better analysis.
Outliers in Airbags
QI = quantile(data$Airbags, 0.25)
QS = quantile(data$Airbags, 0.75)
IQR = QS-QI
sum(data$Airbags < QI - 1.5*IQR | data$Airbags > QS + 1.5*IQR)## [1] 0
ggplot(data = data, mapping = aes(y = Airbags)) +
geom_boxplot(fill = "mediumslateblue", outlier.color = "thistle", outlier.shape = 4) +
ggtitle(expression(atop("Boxplot of Airbags", atop(italic("In Red Outliers"))))) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(plot.subtitle = element_text(hjust = 0.5))From the plot, we see that there are not outliers for the Airbags variable.
New and Turbo are also numeric variables, but they were taken from Mileage and Engine Volume and actually represent logic values.
Outliers in Fuel index
QI = quantile(data$Fuel.idx, 0.25)
QS = quantile(data$Fuel.idx, 0.75)
IQR = QS-QI
sum(data$Fuel.idx < QI - 1.5*IQR | data$Fuel.idx > QS + 1.5*IQR)## [1] 1
ggplot(data = data, mapping = aes(y = Fuel.idx)) +
geom_boxplot(fill = "plum2", outlier.color = "red", outlier.shape = 4) +
ggtitle(expression(atop("Boxplot of Fuel idx", atop(italic("In Red Outliers"))))) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(plot.subtitle = element_text(hjust = 0.5))We have 1 outlier that can be safely removed from the data.
outlier = which(data$Fuel.idx < QI - 1.5*IQR | data$Fuel.idx > QS + 1.5*IQR)
data = data[-outlier, ]Visualization
In this part we consider general plots to get an insight of the data set. We also try to start looking at relations with our target variables.
plot_ly(data, labels = ~Color, type = 'pie') %>%
layout(title = 'Color of Cars',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))This plot shows the different proportions of the colors of cars that are being sold. We see that black, white and silver cars are the 3 top colors of cars on sale. I found interesting that 6.93% of the cars are purple, which I usually consider it to be as an unusual color for a car.
ggplot(data = data) +
geom_bar(mapping = aes(x = Category, fill = Fuel.type))In this graph we can appreciate the number of cars that are on sale depending on the category that they belong to. For each category, the bar is divided in regard to the proportion of fuel type that is mainly used by that category. Furthermore, we can see that the most common category is Sedan. That makes sense, as Sedan makes reference to the normal cars in general ( defined as a 4-door passenger car with a trunk that is separate from the passengers). Among all cars, Petrol seems to be the most common fuel type, follower by Diesel and Hybrid.
ggplot(data = data, aes(x = reorder(Manufacturer, Price),y = Price)) +
geom_bar(stat ='identity',aes(fill = Price))+
coord_flip() +
theme_grey() +
scale_fill_gradient(name="Price Level")+
labs(title = 'Ranking of Manufacturers by Price',
y='Manufacturer', x='Price')+
geom_hline(yintercept = mean(data$Price),size = 1, color = 'blue')Not many cars have a very high price, but from the plot we see how “Hyundai” and “Toyota” are the two companies with higher prices for their cars in this data set. There are many others which have common cheap prices.
ggplot(data, aes(Engine.volume)) + geom_density(aes(fill=factor(Cylinders)), alpha=0.8) +
labs(title="Density plot",
subtitle="Engine volume grouped by number of cylinders",
x="Engine Volume",
fill="Cylinders")At first, someone may think that the engine volume should depend on the number of cylinders of a car. Cars with higher number of cylinders, that are the most powerful cars, always try to optimize the volume of the engine so that there are more cylinders in a reduced space which also means a reduction in the weight (but that means an increase in the price, usually)
We will later see if this idea is true or not. But the cars with more cylinders in a very low engine (low engine volume), will cost (hypothetically) more than the others, being sportive cars for instance.
ggplot(data, aes(Gear.box.type, Prod.year)) + geom_boxplot(aes(fill=factor(Pollution))) +
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
labs(title = "Box plot",
subtitle = "Production year grouped by Type of transmission",
x = "Type of transmission",
y= "Production Year",
fill = "Pollution")From this plot we see that the automatic cars are the most recent type of cars and the manual cars the most antique.
Also, the cars with higher pollution were produced approximately between 2000 and 2010, and tiptronic cars are the ones that contribute most to pollution. We can see how for each transmission type, the cars that are more pollutant were produced earlier than the ones that are less pollutant, which makes sense as during the years, governments have been adopting measurements to stop pollution
ggplot(data, aes(x = Airbags)) +
geom_histogram(fill="darkseagreen2", col="darkslateblue", binwidth = 2)The histogram above shows the amount of cars that have different number of airbags. As it can be observed, most of the vehicles have around 4 airbags. There are also many vehicles with around 12 airbags.
ggplot(data, aes(x = Engine.volume, y = Levy, col = Cylinders)) +
geom_point(size = 2, shape = 1, alpha = 0.4) +
labs(x = "Engine Volume",y = "Levy", color = "Cylinders")When comparing the influence of the engine volume in the Levy, we can notice that there is a kind of positive tendency as the engine volume increases related with the levy paid for a car (insurance). This is probably because as the size of the engine increases, the car is better (also more expensive) and so the insurance companies usually require a greater amount of money to be paid
ggplot(data) + geom_point(aes(x = Price, y = Prod.year, color = Gear.box.type),
size=1)Indeed, most of the cars are automatic, mainly from 2005 to 2020. In the earliest 20’s there were more manual cars and the price was smaller. As the production year increases, the price increases as well, although there are still many different prices as not all the cars are new nor have the best conditions or characteristics.
ggplot(data, aes(x=Category, y= Gear.box.type, color = Fuel.type)) +
geom_jitter(size = 2, width = 0.1, alpha = 0.4) +
labs(x = "Category",y = "Type of transmission", color = "Type of Fuel")Regarding the different transmission types that we consider in this data set, we can see how each category of cars use a different type of fuel from the plot. The three colors that appear most are the ones corresponding to petrol, hybrid and diesel. Diesel is mainly in the manual cars, and appears most in “Goods wagon” and “Microbus”. Most of the cars use petrol or are hybrids though petrol cars are mainly noticed in manual and tiptronic cars, where “Sedan” and “Jeep” have the greatest number of petrol cars. Hybrid cars are mainly of automatic or variator transmission, and are mainly found in “Hatchback” or “Sedan”.
Let us make some plots taking into account the important variables that we are considering for the project (Pollution and Price)
ggplot(data = data) +
geom_smooth(mapping = aes(x = Prod.year, y = Price), color = 'red')In this plot we can see that, as the production year increases, the price of a car increases as well. Probably because of the change of currency that has taken place among the years. However, from 2017 approximately the price starts decreasing. I feel like this may be due to, for instance, the new renting options that have appeared, generating a decrease on the number of sold cars, and making the people who is trying to sell the cars to decrease the prices in order to actually make profit.
ggplot(data, aes(x=as.factor(Drive.wheels), y=Price)) + geom_boxplot(fill="cornflowerblue") +
ggtitle("Total Price vs Drive wheels")With this plot we observe that cars being sold are usually 4x4, and that the mean price of the three types is slightly similar. We can also notice that the amount of cars of “Front” wheel type is smaller than the amount of the other two (of course, being the 4x4 type the one with more cars). But in general, this distribution is pretty similar for the three types. The most noticeable difference is the amount of cars of each type.
Classification
Let’s start with the classification part, where we will try to predict a target class: Pollution. The algorithm needs to identify which class does a data object belong to.
Preparing data
data2 = data # data where I will make changes on the type of the variables# first character to factor
# then factor to numeric
data2$Manufacturer = as.factor(data2$Manufacturer)
data2$Manufacturer = as.numeric(data2$Manufacturer)
data2$Model = as.factor(data2$Model)
data2$Model = as.numeric(data2$Model)
data2$Category = as.factor(data2$Category)
data2$Category = as.numeric(data2$Category)
data2$Leather.interior = as.factor(data2$Leather.interior)
data2$Leather.interior = as.numeric(data2$Leather.interior)
data2$Fuel.type = as.factor(data2$Fuel.type)
data2$Fuel.type = as.numeric(data2$Fuel.type)
data2$Gear.box.type = as.factor(data2$Gear.box.type)
data2$Gear.box.type = as.numeric(data2$Gear.box.type)
data2$Drive.wheels = as.factor(data2$Drive.wheels)
data2$Drive.wheels = as.numeric(data2$Drive.wheels)
data2$Doors = as.factor(data2$Doors)
data2$Doors = as.numeric(data2$Doors)
data2$Wheel = as.factor(data2$Wheel)
data2$Wheel = as.numeric(data2$Wheel)
data2$Color = as.factor(data2$Color)
data2$Color = as.numeric(data2$Color)
data2$Pollution= as.factor(data2$Pollution)
data2$Pollution = as.numeric(data2$Pollution)We start by converting data to factors. Specifically, the Pollution variable. As it was mentioned before, it will be representing how pollutant a car is (“HIGH”, “MEDIUM” or “LOW”). We can see the number of cars that belong to each category.
data2$Pollution[data2$Pollution == 2] <- 'LOW'
data2$Pollution[data2$Pollution == 1] <- 'HIGH'
data2$Pollution[data2$Pollution == 3] <- 'MEDIUM'
data2$Pollution=as.factor(data2$Pollution)
levels(data2$Pollution)## [1] "HIGH" "LOW" "MEDIUM"
table(data2$Pollution)##
## HIGH LOW MEDIUM
## 2472 4494 9098
Now we create the data partition.
in_train = createDataPartition(data2$Pollution, p = 0.8, list = FALSE)
train = data2[in_train,]
test = data2[-in_train,]
nrow(train)## [1] 12853
nrow(test)## [1] 3211
table(train$Pollution)/length(train$Pollution)##
## HIGH LOW MEDIUM
## 0.1538940 0.2797790 0.5663269
Correlated data: (matrix where the correlation between all the possible pairs of values of the data are displayed; easier to identify and visualize patterns)
# correlation
ggcorr(train[,-21], label = T)The variables are not highly correlated in this data base, so maybe the accuracies of the model are not going to be that high.
Bayes Classifiers
Benchmark
ctrl <- trainControl(method = "cv", number = 5,
classProbs = TRUE,
verboseIter=T)
# We have many predictors, hence use penalized logistic regression
lrFit <- train(Pollution ~ .,
method = "glmnet",
tuneGrid = expand.grid(alpha = seq(0, 1, 0.1), lambda = seq(0, .1, 0.02)),
data = train,
preProcess = c("center", "scale"),
trControl = ctrl)## + Fold1: alpha=0.0, lambda=0.1
## - Fold1: alpha=0.0, lambda=0.1
## + Fold1: alpha=0.1, lambda=0.1
## - Fold1: alpha=0.1, lambda=0.1
## + Fold1: alpha=0.2, lambda=0.1
## - Fold1: alpha=0.2, lambda=0.1
## + Fold1: alpha=0.3, lambda=0.1
## - Fold1: alpha=0.3, lambda=0.1
## + Fold1: alpha=0.4, lambda=0.1
## - Fold1: alpha=0.4, lambda=0.1
## + Fold1: alpha=0.5, lambda=0.1
## - Fold1: alpha=0.5, lambda=0.1
## + Fold1: alpha=0.6, lambda=0.1
## - Fold1: alpha=0.6, lambda=0.1
## + Fold1: alpha=0.7, lambda=0.1
## - Fold1: alpha=0.7, lambda=0.1
## + Fold1: alpha=0.8, lambda=0.1
## - Fold1: alpha=0.8, lambda=0.1
## + Fold1: alpha=0.9, lambda=0.1
## - Fold1: alpha=0.9, lambda=0.1
## + Fold1: alpha=1.0, lambda=0.1
## - Fold1: alpha=1.0, lambda=0.1
## + Fold2: alpha=0.0, lambda=0.1
## - Fold2: alpha=0.0, lambda=0.1
## + Fold2: alpha=0.1, lambda=0.1
## - Fold2: alpha=0.1, lambda=0.1
## + Fold2: alpha=0.2, lambda=0.1
## - Fold2: alpha=0.2, lambda=0.1
## + Fold2: alpha=0.3, lambda=0.1
## - Fold2: alpha=0.3, lambda=0.1
## + Fold2: alpha=0.4, lambda=0.1
## - Fold2: alpha=0.4, lambda=0.1
## + Fold2: alpha=0.5, lambda=0.1
## - Fold2: alpha=0.5, lambda=0.1
## + Fold2: alpha=0.6, lambda=0.1
## - Fold2: alpha=0.6, lambda=0.1
## + Fold2: alpha=0.7, lambda=0.1
## - Fold2: alpha=0.7, lambda=0.1
## + Fold2: alpha=0.8, lambda=0.1
## - Fold2: alpha=0.8, lambda=0.1
## + Fold2: alpha=0.9, lambda=0.1
## - Fold2: alpha=0.9, lambda=0.1
## + Fold2: alpha=1.0, lambda=0.1
## - Fold2: alpha=1.0, lambda=0.1
## + Fold3: alpha=0.0, lambda=0.1
## - Fold3: alpha=0.0, lambda=0.1
## + Fold3: alpha=0.1, lambda=0.1
## - Fold3: alpha=0.1, lambda=0.1
## + Fold3: alpha=0.2, lambda=0.1
## - Fold3: alpha=0.2, lambda=0.1
## + Fold3: alpha=0.3, lambda=0.1
## - Fold3: alpha=0.3, lambda=0.1
## + Fold3: alpha=0.4, lambda=0.1
## - Fold3: alpha=0.4, lambda=0.1
## + Fold3: alpha=0.5, lambda=0.1
## - Fold3: alpha=0.5, lambda=0.1
## + Fold3: alpha=0.6, lambda=0.1
## - Fold3: alpha=0.6, lambda=0.1
## + Fold3: alpha=0.7, lambda=0.1
## - Fold3: alpha=0.7, lambda=0.1
## + Fold3: alpha=0.8, lambda=0.1
## - Fold3: alpha=0.8, lambda=0.1
## + Fold3: alpha=0.9, lambda=0.1
## - Fold3: alpha=0.9, lambda=0.1
## + Fold3: alpha=1.0, lambda=0.1
## - Fold3: alpha=1.0, lambda=0.1
## + Fold4: alpha=0.0, lambda=0.1
## - Fold4: alpha=0.0, lambda=0.1
## + Fold4: alpha=0.1, lambda=0.1
## - Fold4: alpha=0.1, lambda=0.1
## + Fold4: alpha=0.2, lambda=0.1
## - Fold4: alpha=0.2, lambda=0.1
## + Fold4: alpha=0.3, lambda=0.1
## - Fold4: alpha=0.3, lambda=0.1
## + Fold4: alpha=0.4, lambda=0.1
## - Fold4: alpha=0.4, lambda=0.1
## + Fold4: alpha=0.5, lambda=0.1
## - Fold4: alpha=0.5, lambda=0.1
## + Fold4: alpha=0.6, lambda=0.1
## - Fold4: alpha=0.6, lambda=0.1
## + Fold4: alpha=0.7, lambda=0.1
## - Fold4: alpha=0.7, lambda=0.1
## + Fold4: alpha=0.8, lambda=0.1
## - Fold4: alpha=0.8, lambda=0.1
## + Fold4: alpha=0.9, lambda=0.1
## - Fold4: alpha=0.9, lambda=0.1
## + Fold4: alpha=1.0, lambda=0.1
## - Fold4: alpha=1.0, lambda=0.1
## + Fold5: alpha=0.0, lambda=0.1
## - Fold5: alpha=0.0, lambda=0.1
## + Fold5: alpha=0.1, lambda=0.1
## - Fold5: alpha=0.1, lambda=0.1
## + Fold5: alpha=0.2, lambda=0.1
## - Fold5: alpha=0.2, lambda=0.1
## + Fold5: alpha=0.3, lambda=0.1
## - Fold5: alpha=0.3, lambda=0.1
## + Fold5: alpha=0.4, lambda=0.1
## - Fold5: alpha=0.4, lambda=0.1
## + Fold5: alpha=0.5, lambda=0.1
## - Fold5: alpha=0.5, lambda=0.1
## + Fold5: alpha=0.6, lambda=0.1
## - Fold5: alpha=0.6, lambda=0.1
## + Fold5: alpha=0.7, lambda=0.1
## - Fold5: alpha=0.7, lambda=0.1
## + Fold5: alpha=0.8, lambda=0.1
## - Fold5: alpha=0.8, lambda=0.1
## + Fold5: alpha=0.9, lambda=0.1
## - Fold5: alpha=0.9, lambda=0.1
## + Fold5: alpha=1.0, lambda=0.1
## - Fold5: alpha=1.0, lambda=0.1
## Aggregating results
## Selecting tuning parameters
## Fitting alpha = 1, lambda = 0 on full training set
# print(lrFit)
lrPred = predict(lrFit, test)
confusionMatrix(lrPred, test$Pollution)## Confusion Matrix and Statistics
##
## Reference
## Prediction HIGH LOW MEDIUM
## HIGH 492 0 0
## LOW 0 893 7
## MEDIUM 2 5 1812
##
## Overall Statistics
##
## Accuracy : 0.9956
## 95% CI : (0.9927, 0.9976)
## No Information Rate : 0.5665
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9924
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: HIGH Class: LOW Class: MEDIUM
## Sensitivity 0.9960 0.9944 0.9962
## Specificity 1.0000 0.9970 0.9950
## Pos Pred Value 1.0000 0.9922 0.9962
## Neg Pred Value 0.9993 0.9978 0.9950
## Prevalence 0.1538 0.2797 0.5665
## Detection Rate 0.1532 0.2781 0.5643
## Detection Prevalence 0.1532 0.2803 0.5665
## Balanced Accuracy 0.9980 0.9957 0.9956
table(train$Pollution)##
## HIGH LOW MEDIUM
## 1978 3596 7279
maximum=max(table(test$Pollution))
accuracy=maximum/nrow(test)
accuracy## [1] 0.5664902
There is too much noise in this model and the efficiency is not that good so we are going to omit it, too simple. We should consider as good accuracies the ones higher than this one.
LDA
lda.model= lda(Pollution ~ ., data=train, prior=c(0.15,0.28, 0.57))
lda.model## Call:
## lda(Pollution ~ ., data = train, prior = c(0.15, 0.28, 0.57))
##
## Prior probabilities of groups:
## HIGH LOW MEDIUM
## 0.15 0.28 0.57
##
## Group means:
## Price Levy Manufacturer Model Prod.year Category
## HIGH 13768.90 777.7204 25.84075 634.4075 2010.113 7.491405
## LOW 10774.04 578.9163 35.92047 708.2839 2012.139 7.353448
## MEDIUM 16830.01 617.4182 28.30870 632.8404 2011.770 7.360214
## Leather.interior Fuel.type Engine.volume Mileage Cylinders
## HIGH 1.879171 3.791709 3.052932 753029.7 5.931244
## LOW 1.701335 3.178254 2.033259 463084.2 4.174360
## MEDIUM 1.707515 4.111279 1.953716 1100495.2 4.038879
## Gear.box.type Drive.wheels Doors Wheel Color Airbags
## HIGH 1.665319 1.824065 2.921638 1.038423 7.443377 8.458544
## LOW 1.465517 1.905451 2.989155 1.089544 9.695495 7.117631
## MEDIUM 1.503778 1.952466 2.953565 1.087512 9.053579 6.145899
## New Turbo Fuel.idx
## HIGH 0.03791709 0.14307381 0.9757179
## LOW 0.03670745 0.00945495 0.3620189
## MEDIUM 0.02623987 0.12721528 0.9617338
##
## Coefficients of linear discriminants:
## LD1 LD2
## Price -2.046378e-06 1.461952e-05
## Levy 3.038647e-04 1.698808e-04
## Manufacturer -2.426120e-03 2.563873e-03
## Model 6.036631e-04 -9.761718e-05
## Prod.year 1.152644e-02 -2.596967e-02
## Category 5.550186e-02 8.151887e-03
## Leather.interior 9.386334e-02 5.855300e-02
## Fuel.type -8.498417e-02 2.079135e-02
## Engine.volume -4.336888e-01 -9.498375e-01
## Mileage -2.112005e-10 8.582862e-10
## Cylinders -8.433674e-01 -1.331869e+00
## Gear.box.type -4.398902e-02 -8.278581e-02
## Drive.wheels -6.064327e-02 -2.495884e-01
## Doors -1.238897e-01 5.057412e-03
## Wheel 1.236740e-01 -1.216467e-01
## Color 1.282027e-02 6.087599e-03
## Airbags -1.905964e-02 2.692227e-02
## New -1.212960e-01 1.617713e-01
## Turbo 5.403148e-02 -8.818177e-02
## Fuel.idx -1.182585e+01 1.671821e+00
##
## Proportion of trace:
## LD1 LD2
## 0.8896 0.1104
The probabilities of the medium class are greater than the ones from the high and low classes. We can see two linear discriminants as we there are 3 groups.
prediction = predict(lda.model, newdata=test)$class
head(prediction)## [1] MEDIUM MEDIUM HIGH LOW MEDIUM LOW
## Levels: HIGH LOW MEDIUM
# performance by confusion matrix
# predictions in rows, true values in columns
confusionMatrix(prediction, test$Pollution)$table## Reference
## Prediction HIGH LOW MEDIUM
## HIGH 460 0 15
## LOW 0 874 6
## MEDIUM 34 24 1798
confusionMatrix(prediction, test$Pollution)$overall[1]## Accuracy
## 0.9753971
Indeed, we have increased the accuracy with respect to the last model, nearly 97%. This result is probably due to the fact that the variable considered as target was created in the pre-processing part of the project as a combination of other variables and considering the influence of each car to the CO2 emissions.
LDA is using means and variances of each class in order to create a linear boundary (or separation) between them, which will be delimited by the coefficients. The first thing we see are the Prior probabilities of groups, which are the probabilities that already exist in the training data. Then we see the group means, which are the average predictor within each class. Finally, the coefficients of the linear discriminants (LD1 and LD2) are displayed. They provide the highest possible discrimination among various classes to find the linear combination of the features, which is what will separate them into classes of objects with best performance.
model <- lda(Pollution ~ ., data=train, prior=c(0.15,0.28, 0.57))probability = predict(model, test)$posterior
roc.lda <- roc(test$Pollution,probability[,3])
auc(roc.lda) ## Area under the curve: 0.8388
plot.roc(test$Pollution, probability[,3],col="deeppink", print.auc = TRUE, auc.polygon=TRUE, grid=c(0.1, 0.2),
grid.col=c("green", "red"), max.auc.polygon=TRUE,
auc.polygon.col="lightblue", print.thres=TRUE)After plotting the ROC curve, we see the illustration of the the sensitivity and specificity of each of the cut points of the test. The AUC (Area Under the Curve) is relatively high, so we can see that it is a good model to predict in our data base.
QDA
qda.model = qda(Pollution ~ ., data = train, prior=c(0.15,0.28, 0.57))
qda.model## Call:
## qda(Pollution ~ ., data = train, prior = c(0.15, 0.28, 0.57))
##
## Prior probabilities of groups:
## HIGH LOW MEDIUM
## 0.15 0.28 0.57
##
## Group means:
## Price Levy Manufacturer Model Prod.year Category
## HIGH 13768.90 777.7204 25.84075 634.4075 2010.113 7.491405
## LOW 10774.04 578.9163 35.92047 708.2839 2012.139 7.353448
## MEDIUM 16830.01 617.4182 28.30870 632.8404 2011.770 7.360214
## Leather.interior Fuel.type Engine.volume Mileage Cylinders
## HIGH 1.879171 3.791709 3.052932 753029.7 5.931244
## LOW 1.701335 3.178254 2.033259 463084.2 4.174360
## MEDIUM 1.707515 4.111279 1.953716 1100495.2 4.038879
## Gear.box.type Drive.wheels Doors Wheel Color Airbags
## HIGH 1.665319 1.824065 2.921638 1.038423 7.443377 8.458544
## LOW 1.465517 1.905451 2.989155 1.089544 9.695495 7.117631
## MEDIUM 1.503778 1.952466 2.953565 1.087512 9.053579 6.145899
## New Turbo Fuel.idx
## HIGH 0.03791709 0.14307381 0.9757179
## LOW 0.03670745 0.00945495 0.3620189
## MEDIUM 0.02623987 0.12721528 0.9617338
prediction = predict(qda.model, newdata=test)$class
head(prediction)## [1] MEDIUM MEDIUM HIGH LOW MEDIUM LOW
## Levels: HIGH LOW MEDIUM
confusionMatrix(prediction, test$Pollution)$table## Reference
## Prediction HIGH LOW MEDIUM
## HIGH 483 6 35
## LOW 0 887 30
## MEDIUM 11 5 1754
confusionMatrix(prediction, test$Pollution)$overall[1]## Accuracy
## 0.9729056
model <- qda(Pollution ~ ., data=train, prior=c(0.15,0.28, 0.57))We can clearly see that we reach more or less the same conclusions than with the previous analysis, but with a bit less of accuracy. As with LDA, the observation of each class is drawn from a normal distribution. However, the main difference is that QDA assumes that each class has its own covariance matrix and so this will mean an increase in the number of parameters.
probability = predict(model, test)$posterior
roc.lda <- roc(test$Pollution,probability[,3])
auc(roc.lda) ## Area under the curve: 0.8086
plot.roc(test$Pollution, probability[,3],col="deeppink", print.auc = TRUE, auc.polygon=TRUE, grid=c(0.1, 0.2),
grid.col=c("green", "red"), max.auc.polygon=TRUE,
auc.polygon.col="lightblue", print.thres=TRUE)The AUC is a bit lower in this descriptive analysis and we can see that there is no huge difference between the data set and the subset.
Decision Trees
# Hyper-parameters
control = rpart.control(minsplit = 30, maxdepth = 10, cp=0.01)
# minsplit: minimum number of observations in a node before before a split
# maxdepth: maximum depth of any node of the final tree
# cp: degree of complexity, the smaller the more branchesmodel = Pollution ~.
dtFit <- rpart(model, data=train, method = "class", control = control)
summary(dtFit)## Call:
## rpart(formula = model, data = train, method = "class", control = control)
## n= 12853
##
## CP nsplit rel error xerror xstd
## 1 0.60064586 0 1.00000000 1.00000000 0.010079758
## 2 0.33548619 1 0.39935414 0.39935414 0.007696596
## 3 0.01166128 2 0.06386796 0.06386796 0.003337787
## 4 0.01000000 4 0.04054539 0.03946896 0.002638125
##
## Variable importance
## Fuel.idx Cylinders Engine.volume Fuel.type Manufacturer
## 31 24 18 8 7
## Mileage Levy Model Drive.wheels Gear.box.type
## 3 2 2 2 2
## Prod.year
## 1
##
## Node number 1: 12853 observations, complexity param=0.6006459
## predicted class=MEDIUM expected loss=0.4336731 P(node) =1
## class counts: 1978 3596 7279
## probabilities: 0.154 0.280 0.566
## left son=2 (3724 obs) right son=3 (9129 obs)
## Primary splits:
## Fuel.idx < 0.795 to the left, improve=3869.8470, (0 missing)
## Cylinders < 4.5 to the right, improve=1952.6340, (0 missing)
## Engine.volume < 2.55 to the right, improve=1609.0710, (0 missing)
## Fuel.type < 4.5 to the left, improve=1553.7830, (0 missing)
## Manufacturer < 50.5 to the right, improve= 498.9054, (0 missing)
## Surrogate splits:
## Fuel.type < 4.5 to the left, agree=0.777, adj=0.230, (0 split)
## Manufacturer < 50.5 to the right, agree=0.738, adj=0.095, (0 split)
## Mileage < 310032 to the right, agree=0.735, adj=0.086, (0 split)
## Gear.box.type < 3.5 to the right, agree=0.723, adj=0.044, (0 split)
##
## Node number 2: 3724 observations, complexity param=0.01166128
## predicted class=LOW expected loss=0.05075188 P(node) =0.2897378
## class counts: 2 3535 187
## probabilities: 0.001 0.949 0.050
## left son=4 (3415 obs) right son=5 (309 obs)
## Primary splits:
## Prod.year < 2006.5 to the right, improve=124.66870, (0 missing)
## Fuel.type < 2 to the right, improve= 72.22757, (0 missing)
## Cylinders < 4.5 to the left, improve= 70.35871, (0 missing)
## Fuel.idx < 0.435 to the left, improve= 60.80848, (0 missing)
## Engine.volume < 2.55 to the left, improve= 56.22645, (0 missing)
## Surrogate splits:
## Fuel.type < 2 to the right, agree=0.963, adj=0.553, (0 split)
## Drive.wheels < 2.5 to the left, agree=0.924, adj=0.087, (0 split)
## Manufacturer < 5.5 to the right, agree=0.923, adj=0.071, (0 split)
## Model < 1152 to the left, agree=0.921, adj=0.049, (0 split)
## Mileage < 2564160 to the left, agree=0.918, adj=0.006, (0 split)
##
## Node number 3: 9129 observations, complexity param=0.3354862
## predicted class=MEDIUM expected loss=0.2231351 P(node) =0.7102622
## class counts: 1976 61 7092
## probabilities: 0.216 0.007 0.777
## left son=6 (1876 obs) right son=7 (7253 obs)
## Primary splits:
## Cylinders < 4.5 to the right, improve=2863.0490, (0 missing)
## Engine.volume < 2.55 to the right, improve=2331.7610, (0 missing)
## Drive.wheels < 1.5 to the left, improve= 486.8362, (0 missing)
## Levy < 913 to the right, improve= 484.9521, (0 missing)
## Manufacturer < 4.5 to the left, improve= 421.9788, (0 missing)
## Surrogate splits:
## Engine.volume < 2.75 to the right, agree=0.948, adj=0.746, (0 split)
## Manufacturer < 4.5 to the left, agree=0.829, adj=0.169, (0 split)
## Levy < 1381 to the right, agree=0.814, adj=0.093, (0 split)
## Model < 1172.5 to the right, agree=0.812, adj=0.087, (0 split)
## Drive.wheels < 1.5 to the left, agree=0.810, adj=0.076, (0 split)
##
## Node number 4: 3415 observations
## predicted class=LOW expected loss=0.01171303 P(node) =0.2656967
## class counts: 1 3375 39
## probabilities: 0.000 0.988 0.011
##
## Node number 5: 309 observations, complexity param=0.01166128
## predicted class=LOW expected loss=0.4822006 P(node) =0.02404108
## class counts: 1 160 148
## probabilities: 0.003 0.518 0.479
## left son=10 (174 obs) right son=11 (135 obs)
## Primary splits:
## Cylinders < 4.5 to the left, improve=120.30840, (0 missing)
## Engine.volume < 2.45 to the left, improve=101.43390, (0 missing)
## Price < 11552 to the left, improve= 29.50067, (0 missing)
## Leather.interior < 1.5 to the left, improve= 24.67542, (0 missing)
## Manufacturer < 37.5 to the right, improve= 22.60961, (0 missing)
## Surrogate splits:
## Engine.volume < 2.45 to the left, agree=0.926, adj=0.830, (0 split)
## Leather.interior < 1.5 to the left, agree=0.715, adj=0.348, (0 split)
## Price < 10819.5 to the left, agree=0.712, adj=0.341, (0 split)
## Airbags < 6.5 to the left, agree=0.683, adj=0.274, (0 split)
## Gear.box.type < 2.5 to the left, agree=0.670, adj=0.244, (0 split)
##
## Node number 6: 1876 observations
## predicted class=HIGH expected loss=0.001599147 P(node) =0.1459581
## class counts: 1873 0 3
## probabilities: 0.998 0.000 0.002
##
## Node number 7: 7253 observations
## predicted class=MEDIUM expected loss=0.02261133 P(node) =0.5643041
## class counts: 103 61 7089
## probabilities: 0.014 0.008 0.977
##
## Node number 10: 174 observations
## predicted class=LOW expected loss=0.09195402 P(node) =0.0135377
## class counts: 0 158 16
## probabilities: 0.000 0.908 0.092
##
## Node number 11: 135 observations
## predicted class=MEDIUM expected loss=0.02222222 P(node) =0.01050338
## class counts: 1 2 132
## probabilities: 0.007 0.015 0.978
rpart.plot(dtFit, digits=3)We start at the root node (top of the graph).
At the top, we can see the overall probability for a car to be pollutant. It shows the proportion of each category of pollution that we have. 15 percent for the HIGH , 28 percent for the LOW and 56 percent for the MEDIUM. This node asks whether the index of Fuel is greater than 0.795 or not, and then go down to the root’s child nodes (depth 2) depending on the answer. If we look at the graph, we can clearly follow the path depending on the different characteristics that we have for a car, to finally arrive to a conclusion of whether the car is pollutant or not ( this happens when we reach a leaf node).
control = rpart.control(minsplit = 40, maxdepth = 12, cp=0.001)
dtFit <- rpart(model, data=train, method = "class", control = control)
summary(dtFit)## Call:
## rpart(formula = model, data = train, method = "class", control = control)
## n= 12853
##
## CP nsplit rel error xerror xstd
## 1 0.600645856 0 1.000000000 1.000000000 0.010079758
## 2 0.335486186 1 0.399354144 0.399354144 0.007696596
## 3 0.011661285 2 0.063867958 0.063867958 0.003337787
## 4 0.009508432 4 0.040545389 0.043774668 0.002775658
## 5 0.007176175 5 0.031036957 0.031754575 0.002370328
## 6 0.006458558 6 0.023860782 0.028525296 0.002248167
## 7 0.001973448 7 0.017402225 0.018299247 0.001804692
## 8 0.001674441 9 0.013455328 0.013634733 0.001559380
## 9 0.001000000 12 0.008432006 0.008073197 0.001201373
##
## Variable importance
## Fuel.idx Cylinders Engine.volume Fuel.type Manufacturer
## 30 24 19 7 7
## Mileage Levy Model Drive.wheels Gear.box.type
## 3 2 2 2 2
## Prod.year Price
## 1 1
##
## Node number 1: 12853 observations, complexity param=0.6006459
## predicted class=MEDIUM expected loss=0.4336731 P(node) =1
## class counts: 1978 3596 7279
## probabilities: 0.154 0.280 0.566
## left son=2 (3724 obs) right son=3 (9129 obs)
## Primary splits:
## Fuel.idx < 0.795 to the left, improve=3869.8470, (0 missing)
## Cylinders < 4.5 to the right, improve=1952.6340, (0 missing)
## Engine.volume < 2.55 to the right, improve=1609.0710, (0 missing)
## Fuel.type < 4.5 to the left, improve=1553.7830, (0 missing)
## Manufacturer < 50.5 to the right, improve= 498.9054, (0 missing)
## Surrogate splits:
## Fuel.type < 4.5 to the left, agree=0.777, adj=0.230, (0 split)
## Manufacturer < 50.5 to the right, agree=0.738, adj=0.095, (0 split)
## Mileage < 310032 to the right, agree=0.735, adj=0.086, (0 split)
## Gear.box.type < 3.5 to the right, agree=0.723, adj=0.044, (0 split)
##
## Node number 2: 3724 observations, complexity param=0.01166128
## predicted class=LOW expected loss=0.05075188 P(node) =0.2897378
## class counts: 2 3535 187
## probabilities: 0.001 0.949 0.050
## left son=4 (3415 obs) right son=5 (309 obs)
## Primary splits:
## Prod.year < 2006.5 to the right, improve=124.66870, (0 missing)
## Fuel.type < 2 to the right, improve= 72.22757, (0 missing)
## Cylinders < 4.5 to the left, improve= 70.35871, (0 missing)
## Fuel.idx < 0.435 to the left, improve= 60.80848, (0 missing)
## Engine.volume < 2.55 to the left, improve= 56.22645, (0 missing)
## Surrogate splits:
## Fuel.type < 2 to the right, agree=0.963, adj=0.553, (0 split)
## Drive.wheels < 2.5 to the left, agree=0.924, adj=0.087, (0 split)
## Manufacturer < 5.5 to the right, agree=0.923, adj=0.071, (0 split)
## Model < 1152 to the left, agree=0.921, adj=0.049, (0 split)
## Mileage < 2564160 to the left, agree=0.918, adj=0.006, (0 split)
##
## Node number 3: 9129 observations, complexity param=0.3354862
## predicted class=MEDIUM expected loss=0.2231351 P(node) =0.7102622
## class counts: 1976 61 7092
## probabilities: 0.216 0.007 0.777
## left son=6 (1876 obs) right son=7 (7253 obs)
## Primary splits:
## Cylinders < 4.5 to the right, improve=2863.0490, (0 missing)
## Engine.volume < 2.55 to the right, improve=2331.7610, (0 missing)
## Drive.wheels < 1.5 to the left, improve= 486.8362, (0 missing)
## Levy < 913 to the right, improve= 484.9521, (0 missing)
## Manufacturer < 4.5 to the left, improve= 421.9788, (0 missing)
## Surrogate splits:
## Engine.volume < 2.75 to the right, agree=0.948, adj=0.746, (0 split)
## Manufacturer < 4.5 to the left, agree=0.829, adj=0.169, (0 split)
## Levy < 1381 to the right, agree=0.814, adj=0.093, (0 split)
## Model < 1172.5 to the right, agree=0.812, adj=0.087, (0 split)
## Drive.wheels < 1.5 to the left, agree=0.810, adj=0.076, (0 split)
##
## Node number 4: 3415 observations, complexity param=0.001674441
## predicted class=LOW expected loss=0.01171303 P(node) =0.2656967
## class counts: 1 3375 39
## probabilities: 0.000 0.988 0.011
## left son=8 (3399 obs) right son=9 (16 obs)
## Primary splits:
## Fuel.type < 2 to the right, improve=4.246581, (0 missing)
## Doors < 1.5 to the right, improve=4.194169, (0 missing)
## Fuel.idx < 0.435 to the left, improve=3.704599, (0 missing)
## Price < 34340.5 to the left, improve=2.239090, (0 missing)
## Cylinders < 6.5 to the left, improve=1.847703, (0 missing)
##
## Node number 5: 309 observations, complexity param=0.01166128
## predicted class=LOW expected loss=0.4822006 P(node) =0.02404108
## class counts: 1 160 148
## probabilities: 0.003 0.518 0.479
## left son=10 (174 obs) right son=11 (135 obs)
## Primary splits:
## Cylinders < 4.5 to the left, improve=120.30840, (0 missing)
## Engine.volume < 2.45 to the left, improve=101.43390, (0 missing)
## Price < 11552 to the left, improve= 29.50067, (0 missing)
## Leather.interior < 1.5 to the left, improve= 24.67542, (0 missing)
## Manufacturer < 37.5 to the right, improve= 22.60961, (0 missing)
## Surrogate splits:
## Engine.volume < 2.45 to the left, agree=0.926, adj=0.830, (0 split)
## Leather.interior < 1.5 to the left, agree=0.715, adj=0.348, (0 split)
## Price < 10819.5 to the left, agree=0.712, adj=0.341, (0 split)
## Airbags < 6.5 to the left, agree=0.683, adj=0.274, (0 split)
## Gear.box.type < 2.5 to the left, agree=0.670, adj=0.244, (0 split)
##
## Node number 6: 1876 observations
## predicted class=HIGH expected loss=0.001599147 P(node) =0.1459581
## class counts: 1873 0 3
## probabilities: 0.998 0.000 0.002
##
## Node number 7: 7253 observations, complexity param=0.009508432
## predicted class=MEDIUM expected loss=0.02261133 P(node) =0.5643041
## class counts: 103 61 7089
## probabilities: 0.014 0.008 0.977
## left son=14 (153 obs) right son=15 (7100 obs)
## Primary splits:
## Engine.volume < 2.6 to the right, improve=134.043900, (0 missing)
## Cylinders < 3.5 to the left, improve= 81.845530, (0 missing)
## Levy < 1097 to the right, improve= 6.559693, (0 missing)
## Model < 1143 to the right, improve= 2.736578, (0 missing)
## Prod.year < 2011.5 to the left, improve= 2.482419, (0 missing)
##
## Node number 8: 3399 observations, complexity param=0.001674441
## predicted class=LOW expected loss=0.01000294 P(node) =0.2644519
## class counts: 1 3365 33
## probabilities: 0.000 0.990 0.010
## left son=16 (2837 obs) right son=17 (562 obs)
## Primary splits:
## Fuel.idx < 0.46 to the left, improve=2.6587290, (0 missing)
## Fuel.type < 3.5 to the left, improve=2.3201490, (0 missing)
## Cylinders < 6.5 to the left, improve=1.8708180, (0 missing)
## Engine.volume < 2.35 to the left, improve=0.8907350, (0 missing)
## Drive.wheels < 2.5 to the left, improve=0.8886964, (0 missing)
## Surrogate splits:
## Fuel.type < 3.5 to the left, agree=0.981, adj=0.886, (0 split)
## Model < 999 to the left, agree=0.889, adj=0.327, (0 split)
## Mileage < 437213.5 to the left, agree=0.869, adj=0.210, (0 split)
## Engine.volume < 1.05 to the right, agree=0.835, adj=0.004, (0 split)
##
## Node number 9: 16 observations
## predicted class=LOW expected loss=0.375 P(node) =0.001244846
## class counts: 0 10 6
## probabilities: 0.000 0.625 0.375
##
## Node number 10: 174 observations
## predicted class=LOW expected loss=0.09195402 P(node) =0.0135377
## class counts: 0 158 16
## probabilities: 0.000 0.908 0.092
##
## Node number 11: 135 observations
## predicted class=MEDIUM expected loss=0.02222222 P(node) =0.01050338
## class counts: 1 2 132
## probabilities: 0.007 0.015 0.978
##
## Node number 14: 153 observations, complexity param=0.007176175
## predicted class=HIGH expected loss=0.3267974 P(node) =0.01190384
## class counts: 103 0 50
## probabilities: 0.673 0.000 0.327
## left son=28 (103 obs) right son=29 (50 obs)
## Primary splits:
## Manufacturer < 50 to the left, improve=48.80570, (0 missing)
## Airbags < 7 to the left, improve=36.00009, (0 missing)
## Model < 1047.5 to the left, improve=29.02608, (0 missing)
## Price < 4312.5 to the right, improve=26.12838, (0 missing)
## Engine.volume < 2.75 to the right, improve=23.62806, (0 missing)
## Surrogate splits:
## Price < 4312.5 to the right, agree=0.863, adj=0.58, (0 split)
## Airbags < 7 to the left, agree=0.863, adj=0.58, (0 split)
## Model < 1047.5 to the left, agree=0.830, adj=0.48, (0 split)
## Levy < 849 to the right, agree=0.797, adj=0.38, (0 split)
## Drive.wheels < 2.5 to the left, agree=0.797, adj=0.38, (0 split)
##
## Node number 15: 7100 observations, complexity param=0.006458558
## predicted class=MEDIUM expected loss=0.008591549 P(node) =0.5524002
## class counts: 0 61 7039
## probabilities: 0.000 0.009 0.991
## left son=30 (86 obs) right son=31 (7014 obs)
## Primary splits:
## Cylinders < 3.5 to the left, improve=85.486710, (0 missing)
## Engine.volume < 1.05 to the left, improve=28.243920, (0 missing)
## Model < 1143 to the right, improve= 2.163930, (0 missing)
## Wheel < 1.5 to the right, improve= 1.390551, (0 missing)
## Leather.interior < 1.5 to the left, improve= 1.205498, (0 missing)
## Surrogate splits:
## Engine.volume < 0.9 to the left, agree=0.988, adj=0.023, (0 split)
##
## Node number 16: 2837 observations
## predicted class=LOW expected loss=0.001057455 P(node) =0.2207267
## class counts: 0 2834 3
## probabilities: 0.000 0.999 0.001
##
## Node number 17: 562 observations, complexity param=0.001674441
## predicted class=LOW expected loss=0.05516014 P(node) =0.0437252
## class counts: 1 531 30
## probabilities: 0.002 0.945 0.053
## left son=34 (534 obs) right son=35 (28 obs)
## Primary splits:
## Engine.volume < 2.2 to the left, improve=52.71305, (0 missing)
## Airbags < 5 to the left, improve=17.11212, (0 missing)
## Model < 821.5 to the right, improve=15.64918, (0 missing)
## Levy < 225 to the right, improve=10.96933, (0 missing)
## Color < 7 to the right, improve=10.96586, (0 missing)
## Surrogate splits:
## Cylinders < 4.5 to the left, agree=0.959, adj=0.179, (0 split)
## Category < 10.5 to the left, agree=0.957, adj=0.143, (0 split)
## Manufacturer < 52.5 to the left, agree=0.956, adj=0.107, (0 split)
## Model < 307 to the right, agree=0.956, adj=0.107, (0 split)
## Price < 37500 to the left, agree=0.954, adj=0.071, (0 split)
##
## Node number 28: 103 observations
## predicted class=HIGH expected loss=0.04854369 P(node) =0.008013693
## class counts: 98 0 5
## probabilities: 0.951 0.000 0.049
##
## Node number 29: 50 observations
## predicted class=MEDIUM expected loss=0.1 P(node) =0.003890142
## class counts: 5 0 45
## probabilities: 0.100 0.000 0.900
##
## Node number 30: 86 observations, complexity param=0.001973448
## predicted class=LOW expected loss=0.2906977 P(node) =0.006691045
## class counts: 0 61 25
## probabilities: 0.000 0.709 0.291
## left son=60 (37 obs) right son=61 (49 obs)
## Primary splits:
## Engine.volume < 1.1 to the left, improve=10.975320, (0 missing)
## Manufacturer < 48 to the right, improve= 5.306386, (0 missing)
## Model < 832.5 to the right, improve= 3.994005, (0 missing)
## Cylinders < 2.5 to the left, improve= 2.797629, (0 missing)
## Levy < 643.5 to the left, improve= 2.526100, (0 missing)
## Surrogate splits:
## Model < 859 to the right, agree=0.791, adj=0.514, (0 split)
## Manufacturer < 49.5 to the right, agree=0.756, adj=0.432, (0 split)
## Cylinders < 2.5 to the right, agree=0.733, adj=0.378, (0 split)
## Prod.year < 2014.5 to the right, agree=0.663, adj=0.216, (0 split)
## Levy < 422.5 to the right, agree=0.640, adj=0.162, (0 split)
##
## Node number 31: 7014 observations
## predicted class=MEDIUM expected loss=0 P(node) =0.5457092
## class counts: 0 0 7014
## probabilities: 0.000 0.000 1.000
##
## Node number 34: 534 observations
## predicted class=LOW expected loss=0.005617978 P(node) =0.04154672
## class counts: 1 531 2
## probabilities: 0.002 0.994 0.004
##
## Node number 35: 28 observations
## predicted class=MEDIUM expected loss=0 P(node) =0.00217848
## class counts: 0 0 28
## probabilities: 0.000 0.000 1.000
##
## Node number 60: 37 observations
## predicted class=LOW expected loss=0 P(node) =0.002878705
## class counts: 0 37 0
## probabilities: 0.000 1.000 0.000
##
## Node number 61: 49 observations, complexity param=0.001973448
## predicted class=MEDIUM expected loss=0.4897959 P(node) =0.00381234
## class counts: 0 24 25
## probabilities: 0.000 0.490 0.510
## left son=122 (27 obs) right son=123 (22 obs)
## Primary splits:
## Cylinders < 2.5 to the left, improve=19.156460, (0 missing)
## Engine.volume < 1.25 to the right, improve= 8.675402, (0 missing)
## Manufacturer < 29 to the left, improve= 7.546939, (0 missing)
## Model < 737.5 to the left, improve= 7.432940, (0 missing)
## Prod.year < 2011 to the left, improve= 6.837164, (0 missing)
## Surrogate splits:
## Engine.volume < 1.25 to the right, agree=0.837, adj=0.636, (0 split)
## Prod.year < 2008.5 to the left, agree=0.816, adj=0.591, (0 split)
## Levy < 253 to the left, agree=0.796, adj=0.545, (0 split)
## Manufacturer < 34 to the left, agree=0.776, adj=0.500, (0 split)
## Mileage < 130500 to the right, agree=0.735, adj=0.409, (0 split)
##
## Node number 122: 27 observations
## predicted class=LOW expected loss=0.1111111 P(node) =0.002100677
## class counts: 0 24 3
## probabilities: 0.000 0.889 0.111
##
## Node number 123: 22 observations
## predicted class=MEDIUM expected loss=0 P(node) =0.001711663
## class counts: 0 0 22
## probabilities: 0.000 0.000 1.000
rpart.plot(dtFit, digits=3)In this tree, we notice that we have more nodes (the tree is now full). We have set the complexity parameters differently. Still, we start at the root node (initial split) and follow the path depending on the car features to know the chance of it of being too pollutant, medium, or low. The branches link the nodes together and refer to the corresponding variable. The decision tree is a very useful tool in order to do simple predictions about our data set. It allows us to do a deeper analysis and compute estimations that with an exploratory analysis we are not able to obtain.
Prediction
dtPred <- predict(dtFit, test, type = "class")
dtProb <- predict(dtFit, test, type = "prob")
threshold = 0.2
dtPred = rep("HIGH", nrow(test))
dtPred[which(dtProb[,2] > threshold)] = "LOW"
dtPred[which(dtProb[,3] > threshold)] = "MEDIUM"
confusionMatrix(factor(dtPred), test$Pollution)## Confusion Matrix and Statistics
##
## Reference
## Prediction HIGH LOW MEDIUM
## HIGH 494 0 2
## LOW 0 896 5
## MEDIUM 0 2 1812
##
## Overall Statistics
##
## Accuracy : 0.9972
## 95% CI : (0.9947, 0.9987)
## No Information Rate : 0.5665
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9951
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: HIGH Class: LOW Class: MEDIUM
## Sensitivity 1.0000 0.9978 0.9962
## Specificity 0.9993 0.9978 0.9986
## Pos Pred Value 0.9960 0.9945 0.9989
## Neg Pred Value 1.0000 0.9991 0.9950
## Prevalence 0.1538 0.2797 0.5665
## Detection Rate 0.1538 0.2790 0.5643
## Detection Prevalence 0.1545 0.2806 0.5649
## Balanced Accuracy 0.9996 0.9978 0.9974
Regarding accuracy, we obtain a very similar accuracy to the last models, which we have seen is a good accuracy in our database. With the confusion matrix we can better evaluate the classification performance. The idea is to count the number of times True instances are classified as False.
With Caret
caret.fit <- train(model,
data = train,
method = "rpart",
control=rpart.control(minsplit = 40, maxdepth = 12),
trControl = trainControl(method = "cv", number = 5),
tuneLength=10)
caret.fit## CART
##
## 12853 samples
## 20 predictor
## 3 classes: 'HIGH', 'LOW', 'MEDIUM'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 10282, 10282, 10282, 10282, 10284
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.0003588088 0.9956426 0.9924582
## 0.0011661285 0.9958762 0.9928679
## 0.0016744409 0.9947088 0.9908569
## 0.0019734482 0.9939309 0.9895153
## 0.0064585576 0.9898848 0.9824924
## 0.0071761751 0.9877839 0.9788796
## 0.0095084320 0.9853724 0.9746656
## 0.0116612845 0.9760353 0.9584628
## 0.3354861859 0.8845307 0.7784968
## 0.6006458558 0.7202366 0.3944380
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.001166128.
rpart.plot(caret.fit$finalModel)The same thing can be done using “Caret”. As we see the re-sampling was done with the cross-validation method (as we set it in the control argument), with 5 folds. The accuracy of the results varies from 60% to 99%. The one selected to obtain the best model was, of course, the largest one with a complexity parameter (cp) of 0.001255831. The accuracy is still very high (good).
dtProb <- predict(caret.fit, test, type = "prob")
threshold = 0.2
dtPred = rep("HIGH", nrow(test))
dtPred[which(dtProb[,2] > threshold)] = "LOW"
dtPred[which(dtProb[,3] > threshold)] = "MEDIUM"
confusionMatrix(factor(dtPred), test$Pollution)$table## Reference
## Prediction HIGH LOW MEDIUM
## HIGH 494 0 2
## LOW 0 896 5
## MEDIUM 0 2 1812
CM = confusionMatrix(factor(dtPred), test$Pollution)$tableRandom Forest
rf.train <- randomForest(Pollution ~., data=train,ntree=200
,mtry=10,importance=TRUE, do.trace=T, cutoff=c(0.3, 0.2,0.3))## ntree OOB 1 2 3
## 1: 0.45% 0.27% 0.31% 0.56%
## 2: 0.42% 0.41% 0.24% 0.51%
## 3: 0.30% 0.33% 0.11% 0.39%
## 4: 0.35% 0.42% 0.20% 0.41%
## 5: 0.33% 0.34% 0.18% 0.40%
## 6: 0.37% 0.38% 0.21% 0.46%
## 7: 0.32% 0.37% 0.06% 0.43%
## 8: 0.26% 0.21% 0.06% 0.37%
## 9: 0.30% 0.26% 0.08% 0.42%
## 10: 0.28% 0.20% 0.08% 0.40%
## 11: 0.25% 0.20% 0.06% 0.36%
## 12: 0.23% 0.15% 0.03% 0.36%
## 13: 0.23% 0.15% 0.03% 0.34%
## 14: 0.20% 0.15% 0.03% 0.30%
## 15: 0.22% 0.15% 0.03% 0.33%
## 16: 0.19% 0.15% 0.03% 0.29%
## 17: 0.19% 0.15% 0.03% 0.27%
## 18: 0.19% 0.20% 0.03% 0.27%
## 19: 0.19% 0.20% 0.03% 0.27%
## 20: 0.20% 0.15% 0.03% 0.30%
## 21: 0.19% 0.15% 0.03% 0.27%
## 22: 0.19% 0.10% 0.03% 0.29%
## 23: 0.18% 0.10% 0.03% 0.27%
## 24: 0.19% 0.10% 0.03% 0.29%
## 25: 0.18% 0.10% 0.03% 0.27%
## 26: 0.18% 0.15% 0.03% 0.26%
## 27: 0.19% 0.15% 0.03% 0.29%
## 28: 0.19% 0.15% 0.03% 0.27%
## 29: 0.18% 0.15% 0.03% 0.26%
## 30: 0.17% 0.10% 0.03% 0.26%
## 31: 0.16% 0.10% 0.03% 0.25%
## 32: 0.16% 0.10% 0.03% 0.25%
## 33: 0.16% 0.15% 0.03% 0.23%
## 34: 0.15% 0.10% 0.03% 0.22%
## 35: 0.14% 0.10% 0.03% 0.21%
## 36: 0.14% 0.10% 0.03% 0.21%
## 37: 0.13% 0.10% 0.03% 0.19%
## 38: 0.16% 0.10% 0.03% 0.25%
## 39: 0.16% 0.10% 0.03% 0.23%
## 40: 0.16% 0.10% 0.03% 0.23%
## 41: 0.15% 0.10% 0.03% 0.22%
## 42: 0.17% 0.15% 0.03% 0.25%
## 43: 0.17% 0.15% 0.03% 0.25%
## 44: 0.16% 0.10% 0.03% 0.25%
## 45: 0.17% 0.10% 0.03% 0.26%
## 46: 0.17% 0.10% 0.03% 0.26%
## 47: 0.17% 0.10% 0.03% 0.26%
## 48: 0.18% 0.15% 0.03% 0.26%
## 49: 0.18% 0.15% 0.03% 0.26%
## 50: 0.16% 0.10% 0.03% 0.25%
## 51: 0.16% 0.10% 0.03% 0.25%
## 52: 0.17% 0.15% 0.03% 0.25%
## 53: 0.17% 0.10% 0.03% 0.26%
## 54: 0.16% 0.10% 0.03% 0.25%
## 55: 0.17% 0.10% 0.03% 0.26%
## 56: 0.16% 0.10% 0.03% 0.25%
## 57: 0.16% 0.15% 0.03% 0.23%
## 58: 0.15% 0.10% 0.03% 0.22%
## 59: 0.16% 0.10% 0.03% 0.23%
## 60: 0.16% 0.10% 0.03% 0.23%
## 61: 0.16% 0.10% 0.03% 0.25%
## 62: 0.16% 0.10% 0.03% 0.23%
## 63: 0.15% 0.10% 0.03% 0.22%
## 64: 0.15% 0.10% 0.03% 0.22%
## 65: 0.15% 0.10% 0.03% 0.22%
## 66: 0.15% 0.10% 0.03% 0.22%
## 67: 0.16% 0.15% 0.03% 0.23%
## 68: 0.16% 0.15% 0.03% 0.23%
## 69: 0.15% 0.10% 0.03% 0.22%
## 70: 0.15% 0.10% 0.03% 0.22%
## 71: 0.15% 0.10% 0.03% 0.22%
## 72: 0.16% 0.10% 0.03% 0.23%
## 73: 0.14% 0.10% 0.03% 0.21%
## 74: 0.16% 0.10% 0.03% 0.23%
## 75: 0.15% 0.10% 0.03% 0.22%
## 76: 0.14% 0.10% 0.03% 0.21%
## 77: 0.14% 0.10% 0.03% 0.21%
## 78: 0.14% 0.10% 0.03% 0.21%
## 79: 0.14% 0.10% 0.03% 0.21%
## 80: 0.14% 0.10% 0.03% 0.21%
## 81: 0.14% 0.10% 0.03% 0.21%
## 82: 0.15% 0.10% 0.03% 0.22%
## 83: 0.15% 0.10% 0.03% 0.22%
## 84: 0.15% 0.10% 0.03% 0.22%
## 85: 0.14% 0.10% 0.03% 0.21%
## 86: 0.15% 0.10% 0.03% 0.22%
## 87: 0.15% 0.10% 0.03% 0.22%
## 88: 0.15% 0.10% 0.03% 0.22%
## 89: 0.15% 0.10% 0.03% 0.22%
## 90: 0.15% 0.10% 0.03% 0.22%
## 91: 0.16% 0.10% 0.03% 0.23%
## 92: 0.15% 0.10% 0.03% 0.22%
## 93: 0.15% 0.10% 0.03% 0.22%
## 94: 0.16% 0.10% 0.03% 0.23%
## 95: 0.15% 0.10% 0.03% 0.22%
## 96: 0.15% 0.10% 0.03% 0.22%
## 97: 0.14% 0.10% 0.03% 0.21%
## 98: 0.14% 0.10% 0.03% 0.21%
## 99: 0.14% 0.10% 0.03% 0.21%
## 100: 0.14% 0.10% 0.03% 0.21%
## 101: 0.15% 0.10% 0.03% 0.22%
## 102: 0.15% 0.10% 0.03% 0.22%
## 103: 0.15% 0.10% 0.03% 0.22%
## 104: 0.16% 0.10% 0.03% 0.23%
## 105: 0.16% 0.10% 0.03% 0.23%
## 106: 0.16% 0.10% 0.03% 0.23%
## 107: 0.16% 0.10% 0.03% 0.23%
## 108: 0.16% 0.10% 0.03% 0.25%
## 109: 0.16% 0.10% 0.03% 0.25%
## 110: 0.16% 0.10% 0.03% 0.25%
## 111: 0.16% 0.10% 0.03% 0.25%
## 112: 0.17% 0.15% 0.03% 0.25%
## 113: 0.17% 0.15% 0.03% 0.25%
## 114: 0.17% 0.15% 0.03% 0.25%
## 115: 0.16% 0.10% 0.03% 0.25%
## 116: 0.16% 0.10% 0.03% 0.25%
## 117: 0.17% 0.15% 0.03% 0.25%
## 118: 0.16% 0.15% 0.03% 0.23%
## 119: 0.16% 0.15% 0.03% 0.23%
## 120: 0.16% 0.10% 0.03% 0.23%
## 121: 0.16% 0.10% 0.03% 0.23%
## 122: 0.16% 0.10% 0.03% 0.23%
## 123: 0.16% 0.15% 0.03% 0.23%
## 124: 0.15% 0.10% 0.03% 0.22%
## 125: 0.15% 0.10% 0.03% 0.22%
## 126: 0.15% 0.10% 0.03% 0.22%
## 127: 0.15% 0.10% 0.03% 0.22%
## 128: 0.15% 0.10% 0.03% 0.22%
## 129: 0.14% 0.10% 0.03% 0.21%
## 130: 0.15% 0.15% 0.03% 0.21%
## 131: 0.14% 0.10% 0.03% 0.21%
## 132: 0.15% 0.15% 0.03% 0.21%
## 133: 0.15% 0.15% 0.03% 0.21%
## 134: 0.15% 0.15% 0.03% 0.21%
## 135: 0.16% 0.15% 0.03% 0.22%
## 136: 0.16% 0.15% 0.03% 0.22%
## 137: 0.16% 0.15% 0.03% 0.22%
## 138: 0.16% 0.15% 0.03% 0.22%
## 139: 0.16% 0.15% 0.03% 0.22%
## 140: 0.16% 0.15% 0.03% 0.22%
## 141: 0.15% 0.15% 0.03% 0.21%
## 142: 0.15% 0.15% 0.03% 0.21%
## 143: 0.15% 0.15% 0.03% 0.21%
## 144: 0.16% 0.15% 0.03% 0.22%
## 145: 0.15% 0.10% 0.03% 0.22%
## 146: 0.15% 0.10% 0.03% 0.22%
## 147: 0.15% 0.10% 0.03% 0.22%
## 148: 0.15% 0.10% 0.03% 0.22%
## 149: 0.15% 0.10% 0.03% 0.22%
## 150: 0.14% 0.10% 0.03% 0.21%
## 151: 0.14% 0.10% 0.03% 0.21%
## 152: 0.14% 0.10% 0.03% 0.21%
## 153: 0.16% 0.15% 0.03% 0.22%
## 154: 0.16% 0.15% 0.03% 0.22%
## 155: 0.16% 0.15% 0.03% 0.22%
## 156: 0.16% 0.15% 0.03% 0.22%
## 157: 0.16% 0.15% 0.03% 0.22%
## 158: 0.16% 0.15% 0.03% 0.22%
## 159: 0.16% 0.15% 0.03% 0.22%
## 160: 0.16% 0.15% 0.03% 0.22%
## 161: 0.16% 0.15% 0.03% 0.22%
## 162: 0.16% 0.15% 0.03% 0.22%
## 163: 0.16% 0.15% 0.03% 0.22%
## 164: 0.16% 0.15% 0.03% 0.22%
## 165: 0.16% 0.15% 0.03% 0.22%
## 166: 0.16% 0.15% 0.03% 0.22%
## 167: 0.16% 0.15% 0.03% 0.22%
## 168: 0.16% 0.15% 0.03% 0.22%
## 169: 0.16% 0.15% 0.03% 0.22%
## 170: 0.16% 0.15% 0.03% 0.22%
## 171: 0.16% 0.15% 0.03% 0.22%
## 172: 0.16% 0.15% 0.03% 0.22%
## 173: 0.16% 0.15% 0.03% 0.22%
## 174: 0.16% 0.15% 0.03% 0.22%
## 175: 0.16% 0.15% 0.03% 0.22%
## 176: 0.16% 0.15% 0.03% 0.22%
## 177: 0.15% 0.10% 0.03% 0.22%
## 178: 0.15% 0.10% 0.03% 0.22%
## 179: 0.15% 0.10% 0.03% 0.22%
## 180: 0.14% 0.10% 0.03% 0.21%
## 181: 0.14% 0.10% 0.03% 0.21%
## 182: 0.14% 0.10% 0.03% 0.21%
## 183: 0.15% 0.10% 0.03% 0.22%
## 184: 0.15% 0.10% 0.03% 0.22%
## 185: 0.15% 0.10% 0.03% 0.22%
## 186: 0.15% 0.10% 0.03% 0.22%
## 187: 0.15% 0.10% 0.03% 0.22%
## 188: 0.15% 0.10% 0.03% 0.22%
## 189: 0.15% 0.10% 0.03% 0.22%
## 190: 0.15% 0.10% 0.03% 0.22%
## 191: 0.15% 0.10% 0.03% 0.22%
## 192: 0.15% 0.10% 0.03% 0.22%
## 193: 0.15% 0.10% 0.03% 0.22%
## 194: 0.15% 0.10% 0.03% 0.22%
## 195: 0.15% 0.10% 0.03% 0.22%
## 196: 0.15% 0.10% 0.03% 0.22%
## 197: 0.15% 0.10% 0.03% 0.22%
## 198: 0.15% 0.10% 0.03% 0.22%
## 199: 0.15% 0.10% 0.03% 0.22%
## 200: 0.15% 0.10% 0.03% 0.22%
Variable Importance
rf_imp <- varImp(rf.train)
plot(rf_imp, scales = list(y = list(cex = .95)))# mtry: number of variables randomly sampled as candidates at each split
# ntree: number of trees to grow
# cutoff: cutoff probabilities in majority votePrediction
rf.pred <- predict(rf.train, newdata=test)
confusionMatrix(rf.pred, test$Pollution)## Confusion Matrix and Statistics
##
## Reference
## Prediction HIGH LOW MEDIUM
## HIGH 494 0 1
## LOW 0 898 1
## MEDIUM 0 0 1817
##
## Overall Statistics
##
## Accuracy : 0.9994
## 95% CI : (0.9978, 0.9999)
## No Information Rate : 0.5665
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9989
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: HIGH Class: LOW Class: MEDIUM
## Sensitivity 1.0000 1.0000 0.9989
## Specificity 0.9996 0.9996 1.0000
## Pos Pred Value 0.9980 0.9989 1.0000
## Neg Pred Value 1.0000 1.0000 0.9986
## Prevalence 0.1538 0.2797 0.5665
## Detection Rate 0.1538 0.2797 0.5659
## Detection Prevalence 0.1542 0.2800 0.5659
## Balanced Accuracy 0.9998 0.9998 0.9995
rfPred = predict(rf.train, newdata=test)
CM = confusionMatrix(factor(rfPred), test$Pollution)$table
threshold = 0.2
rfProb = predict(rf.train, newdata=test, type="prob")
rfPred = rep("HIGH", nrow(test))
rfPred[which(rfProb[,2] > threshold)] = "LOW"
rfPred[which(rfProb[,3] > threshold)] = "MEDIUM"
confusionMatrix(factor(rfPred), test$Pollution)$table## Reference
## Prediction HIGH LOW MEDIUM
## HIGH 491 0 0
## LOW 0 891 1
## MEDIUM 3 7 1818
Random forest creates several classification trees, not looking for the best prediction but making multiple random predictions. Consequently, there is a higher diversity, which will indeed help the prediction become much smoother.
Furthermore, this algorithm is to be highlighted by being powerful and highly accurate, as well as its feature of being able to perform so many predictions at once, with no need of normalization. Besides, it reduces the risk of overfitting, it provides flexibility (since it can handle both regression and classification tasks with a high degree of accuracy) and it makes it easy to evaluate variable importance, or contribution to the model.
The random forest accuracy we have got , again is very high , which confirms the mentioned benefits since it is, higher than the one from the decision tree algorithm (although not by much, as both accuracies are already pretty high due to the way this variable has been created). Also, the number of grown trees has been 200, while the number of variables tried at each split has been 10. Parameters such as kappa have a great value.
On the whole, between decision trees and random forest, it goes without saying that when looking for the best predictions, random forest is the best choice. Even though it may take more time (being worse for higher dimensions data), it returns a better prediction of the data that is being evaluated.
Advanced Regression
Preparing data
In this part, we will be predicting the price of a car.
# DATA PARTITION
data2 = data # data where i will convert all variables to numerical# first character to factor
# then factor to numeric
data2$Manufacturer = as.factor(data2$Manufacturer)
data2$Manufacturer = as.numeric(data2$Manufacturer)
data2$Model = as.factor(data2$Model)
data2$Model = as.numeric(data2$Model)
data2$Category = as.factor(data2$Category)
data2$Category = as.numeric(data2$Category)
data2$Leather.interior = as.factor(data2$Leather.interior)
data2$Leather.interior = as.numeric(data2$Leather.interior)
data2$Fuel.type = as.factor(data2$Fuel.type)
data2$Fuel.type = as.numeric(data2$Fuel.type)
data2$Gear.box.type = as.factor(data2$Gear.box.type)
data2$Gear.box.type = as.numeric(data2$Gear.box.type)
data2$Drive.wheels = as.factor(data2$Drive.wheels)
data2$Drive.wheels = as.numeric(data2$Drive.wheels)
data2$Doors = as.factor(data2$Doors)
data2$Doors = as.numeric(data2$Doors)
data2$Wheel = as.factor(data2$Wheel)
data2$Wheel = as.numeric(data2$Wheel)
data2$Color = as.factor(data2$Color)
data2$Color = as.numeric(data2$Color)
data2$Pollution= as.factor(data2$Pollution)
data2$Pollution = as.numeric(data2$Pollution)in_train <- createDataPartition(data2$Price, p = 0.75, list = FALSE) # 75% for training
training <- data2[in_train,]
testing <- data2[-in_train,]
nrow(training)## [1] 12050
nrow(testing)## [1] 4014
Linear Approaches
corr=cor(training)First we try a simple regression.
linFit <- lm(Price ~., data=training)
summary(linFit)##
## Call:
## lm(formula = Price ~ ., data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34981 -6545 -583 5694 37060
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.511e+06 5.321e+04 -47.187 < 2e-16 ***
## Levy -5.247e+00 2.589e-01 -20.267 < 2e-16 ***
## Manufacturer 2.501e+01 5.958e+00 4.197 2.73e-05 ***
## Model 7.651e-01 2.916e-01 2.624 0.00871 **
## Prod.year 1.249e+03 2.642e+01 47.268 < 2e-16 ***
## Category -3.308e+02 3.413e+01 -9.691 < 2e-16 ***
## Leather.interior -6.379e+02 2.360e+02 -2.703 0.00689 **
## Fuel.type -1.973e+03 7.740e+01 -25.487 < 2e-16 ***
## Engine.volume 3.061e+03 2.359e+02 12.976 < 2e-16 ***
## Mileage -8.380e-07 2.063e-06 -0.406 0.68460
## Cylinders 4.226e+02 1.633e+02 2.588 0.00965 **
## Gear.box.type 2.605e+03 1.082e+02 24.091 < 2e-16 ***
## Drive.wheels 1.372e+03 1.884e+02 7.283 3.47e-13 ***
## Doors 1.137e+03 4.104e+02 2.771 0.00560 **
## Wheel -2.450e+03 3.649e+02 -6.715 1.96e-11 ***
## Color 3.420e+01 1.651e+01 2.072 0.03831 *
## Airbags -4.660e+02 2.411e+01 -19.328 < 2e-16 ***
## New -4.403e+03 5.003e+02 -8.801 < 2e-16 ***
## Turbo 2.942e+03 3.212e+02 9.158 < 2e-16 ***
## Fuel.idx 1.033e+04 3.958e+02 26.098 < 2e-16 ***
## Pollution 2.073e+03 1.864e+02 11.125 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9502 on 12029 degrees of freedom
## Multiple R-squared: 0.3108, Adjusted R-squared: 0.3097
## F-statistic: 271.3 on 20 and 12029 DF, p-value: < 2.2e-16
We can see that the multiple R-squared is 0.3. This means that the proportion of variation in dependent variable that can be explained by the independent variables is very low. We will try different models later to try to increase it in order to provide a better measure of how well observed outcomes are replicated. But we must take into account that the value is, at first, very low.
It is also to consider that the R-squared value could have been that low because we are not considering many variables for the prediction. Adding more variables to the model may be helpful to improve this value and the performance of the model in general (we will see later).
par(mfrow=c(2,2))
plot(linFit, pch=21 ,bg='blue1',cex=1)These plots show the different lines found to fit the data points available on each plot, so that we can use them for better predictions.
The residuals vs fitted plot is used to detect non-linearity, unequal error variances and outliers. We can see that we have a slightly curve in this graph. However, considering that we have a lot of data points, we could say that linearity seems to be holding reasonably okay as the red line is close to the dashed line. As we move to the right on the x-axis, the spread of the residuals seems to be increasing. At the same time, there are many outliers (with large residual values).
The theoretical quantiles plot (Q-Q) tells us that the data is normally distributed as the points are kind of falling on a 45-degree line.
The fitted values plot is showing the regression line (best line fitting the data) that represents the regression equation of the model.
The residuals vs leverage plot allows us to identify influential observations in the model. We can say that there are very few observations that have a high leverage and so therefore, just few observations have a strong influence on the coefficients of the regression model. If we remove these observations, the coefficients of the model will definitely change. The points that fall outside of Cook’s distance (those are the red dashed lines) are considered to be influential observations. From the graph, we can see that we don’t have any influential points.
predictions=predict(linFit,newdata = testing)
cor(predictions, testing$Prod.year)^2## [1] 0.2244665
par(mfrow=c(1,1)) # to set plots to normal back againRight now, this model is not very good (very low correlation).
Multiple Regression
We will now try to see if with multiple regression we can obtain better results.
linFit <- lm(Price ~ Fuel.idx + Mileage + Price + Levy + Engine.volume + Manufacturer +
Category + Cylinders + Fuel.type +
Gear.box.type + Drive.wheels + Wheel + Color + Airbags + New +
Turbo, data=training)
summary(linFit)##
## Call:
## lm(formula = Price ~ Fuel.idx + Mileage + Price + Levy + Engine.volume +
## Manufacturer + Category + Cylinders + Fuel.type + Gear.box.type +
## Drive.wheels + Wheel + Color + Airbags + New + Turbo, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27494 -8094 -1321 5696 38569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.399e+04 9.667e+02 24.812 < 2e-16 ***
## Fuel.idx 1.069e+04 3.811e+02 28.055 < 2e-16 ***
## Mileage -2.625e-06 2.261e-06 -1.161 0.245682
## Levy -9.422e-01 2.662e-01 -3.539 0.000403 ***
## Engine.volume 4.439e+02 2.441e+02 1.819 0.068983 .
## Manufacturer 7.997e+00 6.444e+00 1.241 0.214592
## Category -1.862e+02 3.615e+01 -5.150 2.64e-07 ***
## Cylinders -1.290e+03 1.511e+02 -8.534 < 2e-16 ***
## Fuel.type -1.569e+03 8.348e+01 -18.798 < 2e-16 ***
## Gear.box.type 2.073e+03 1.155e+02 17.945 < 2e-16 ***
## Drive.wheels 5.219e+02 2.006e+02 2.602 0.009271 **
## Wheel -7.564e+03 3.715e+02 -20.359 < 2e-16 ***
## Color 2.997e+01 1.807e+01 1.659 0.097235 .
## Airbags -2.644e+02 2.586e+01 -10.225 < 2e-16 ***
## New -5.715e+03 5.475e+02 -10.439 < 2e-16 ***
## Turbo 2.049e+03 3.497e+02 5.860 4.74e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10420 on 12034 degrees of freedom
## Multiple R-squared: 0.1707, Adjusted R-squared: 0.1697
## F-statistic: 165.2 on 15 and 12034 DF, p-value: < 2.2e-16
R2 is actually lower than before (some variables may not be significant). Right now these results are just telling us that the independent variables are not explaining much (or any) of the dependent variables. The model we have by the moment is not explaining the variance in the dependent variable in the sample we have.
Model Selection
exhaustive <- regsubsets(Price ~ Fuel.idx + Mileage + Prod.year + Levy + Engine.volume + Manufacturer
+ Model + Category + Cylinders + Leather.interior + Fuel.type +
Gear.box.type + Drive.wheels + Wheel + Color + Airbags + New +
Turbo, data=training)
model = Price ~ Fuel.idx + Mileage + Prod.year + Levy + Engine.volume + Manufacturer+
Model + Category + Cylinders + Leather.interior + Fuel.type +
Gear.box.type + Drive.wheels + Wheel + Color + Airbags + New +
Turbo
linFit <- lm(model, data=training)# ols_step_all_possible(linFit) # All possible subset regressions: the number is exponential with p
# ols_step_best_subset(linFit) # The best subset regression for each p: still exponential with p
ols_step_forward_p(linFit) # forward based on p-value##
## Selection Summary
## ------------------------------------------------------------------------------------------
## Variable Adj.
## Step Entered R-Square R-Square C(p) AIC RMSE
## ------------------------------------------------------------------------------------------
## 1 Prod.year 0.0701 0.0700 4012.2757 258530.7151 11029.3046
## 2 Fuel.idx 0.1286 0.1284 3003.8949 257749.6311 10677.1311
## 3 Fuel.type 0.1770 0.1768 2170.2885 257063.2009 10376.8785
## 4 Gear.box.type 0.2331 0.2328 1203.5800 256214.5424 10017.4101
## 5 Levy 0.2635 0.2632 680.7769 255729.3186 9817.3315
## 6 Airbags 0.2781 0.2778 429.5998 255489.0345 9719.5331
## 7 Wheel 0.2824 0.2820 356.9923 255418.6977 9690.8057
## 8 New 0.2865 0.2860 288.9869 255352.4258 9663.7931
## 9 Turbo 0.2902 0.2896 227.8206 255292.4899 9639.3898
## 10 Category 0.2937 0.2931 169.2541 255234.8027 9615.9453
## 11 Engine.volume 0.2975 0.2968 105.2002 255171.3689 9590.2708
## 12 Drive.wheels 0.2994 0.2987 73.5029 255139.8505 9577.3397
## 13 Manufacturer 0.3013 0.3005 43.8226 255110.2533 9565.1885
## 14 Cylinders 0.3022 0.3014 29.3248 255095.7659 9559.0441
## 15 Leather.interior 0.3026 0.3018 24.1827 255090.6212 9556.6077
## 16 Model 0.3030 0.3021 20.2028 255086.6359 9554.6316
## 17 Color 0.3033 0.3023 17.0328 255083.4589 9552.9763
## ------------------------------------------------------------------------------------------
plot(ols_step_forward_p(linFit))This function builds regression model from a set of candidate predictor variables by entering predictors based on p-values. In the plot we can see how R-square grows exponentially from 0.1 to 0.3 (very low for a model). The same thing happens with the adjusted R-square as it is just adjusting the number of terms in a model (considers and tests different independent variables against the model). C(p) (Mallow’s Cp) is used to assess the fit of the regression model that has been estimated. AIC (and SIBC (information criterions) involve information criteria function calculations for the model.
ols_step_forward_aic(linFit) # forward based on AIC##
## Selection Summary
## ------------------------------------------------------------------------------------------
## Variable AIC Sum Sq RSS R-Sq Adj. R-Sq
## ------------------------------------------------------------------------------------------
## Prod.year 258530.715 1.10433e+11 1.465586e+12 0.07007 0.06999
## Fuel.idx 257749.631 202647123410.469 1.373372e+12 0.12858 0.12844
## Fuel.type 257063.201 278910158867.316 1.297109e+12 0.17697 0.17677
## Gear.box.type 256214.542 3.67321e+11 1.208698e+12 0.23307 0.23281
## Levy 255729.319 4.15218e+11 1.160801e+12 0.26346 0.26315
## Airbags 255489.035 438324655978.113 1.137694e+12 0.27812 0.27776
## Wheel 255418.698 445133838559.480 1.130885e+12 0.28244 0.28202
## New 255352.426 4.51523e+11 1.124496e+12 0.28650 0.28602
## Turbo 255292.490 4.57288e+11 1.118731e+12 0.29015 0.28962
## Category 255234.803 462815677013.161 1.113203e+12 0.29366 0.29307
## Engine.volume 255171.369 468844199439.694 1.107175e+12 0.29749 0.29684
## Drive.wheels 255139.851 471919642250.367 1.104099e+12 0.29944 0.29874
## Manufacturer 255110.253 4.74811e+11 1.101208e+12 0.30127 0.30052
## Cylinders 255095.766 476316693602.173 1.099702e+12 0.30223 0.30142
## Leather.interior 255090.621 476968532587.510 1.09905e+12 0.30264 0.30177
## Model 255086.636 477514302487.423 1.098504e+12 0.30299 0.30206
## Color 255083.459 477986148124.548 1.098033e+12 0.30329 0.30230
## ------------------------------------------------------------------------------------------
plot(ols_step_forward_aic(linFit))In the plot we see the forward ols step based on AIC, were Production year is the most important variable related with the price. This makes sense because through the years, the prices (in general) are under constant change. The fuel information of a car and the gear box type also seems to be important for the price prediction.
ols_step_backward_aic(linFit) # backward AIC##
##
## Backward Elimination Summary
## ------------------------------------------------------------------------------------
## Variable AIC RSS Sum Sq R-Sq Adj. R-Sq
## ------------------------------------------------------------------------------------
## Full Model 255085.426 1.09803e+12 477989141610.556 0.30329 0.30225
## Mileage 255083.459 1.098033e+12 477986148124.555 0.30329 0.30230
## ------------------------------------------------------------------------------------
plot(ols_step_backward_aic(linFit))With this statistical method, we are able to find the simplest model that explains the data. Unlinlike the forward method, Mileage is the variable that seems to explain better the price of the cars.
ols_step_both_aic(linFit) # stepwise AIC##
##
## Stepwise Summary
## ------------------------------------------------------------------------------------------------------
## Variable Method AIC RSS Sum Sq R-Sq Adj. R-Sq
## ------------------------------------------------------------------------------------------------------
## Prod.year addition 258530.715 1.465586e+12 1.10433e+11 0.07007 0.06999
## Fuel.idx addition 257749.631 1.373372e+12 202647123410.469 0.12858 0.12844
## Fuel.type addition 257063.201 1.297109e+12 278910158867.316 0.17697 0.17677
## Gear.box.type addition 256214.542 1.208698e+12 3.67321e+11 0.23307 0.23281
## Levy addition 255729.319 1.160801e+12 4.15218e+11 0.26346 0.26315
## Airbags addition 255489.035 1.137694e+12 438324655978.113 0.27812 0.27776
## Wheel addition 255418.698 1.130885e+12 445133838559.480 0.28244 0.28202
## New addition 255352.426 1.124496e+12 4.51523e+11 0.28650 0.28602
## Turbo addition 255292.490 1.118731e+12 4.57288e+11 0.29015 0.28962
## Category addition 255234.803 1.113203e+12 462815677013.161 0.29366 0.29307
## Engine.volume addition 255171.369 1.107175e+12 468844199439.694 0.29749 0.29684
## Drive.wheels addition 255139.851 1.104099e+12 471919642250.367 0.29944 0.29874
## Manufacturer addition 255110.253 1.101208e+12 4.74811e+11 0.30127 0.30052
## Cylinders addition 255095.766 1.099702e+12 476316693602.173 0.30223 0.30142
## Leather.interior addition 255090.621 1.09905e+12 476968532587.510 0.30264 0.30177
## Model addition 255086.636 1.098504e+12 477514302487.423 0.30299 0.30206
## Color addition 255083.459 1.098033e+12 477986148124.548 0.30329 0.30230
## ------------------------------------------------------------------------------------------------------
plot(ols_step_both_aic(linFit))With the step wise AIC method, we get similar results as the ones with forward AIC.
linFit <- lm(Price ~ Fuel.idx + Mileage + Prod.year + Levy + Engine.volume + Manufacturer+
Model + Category + Cylinders + Leather.interior + Fuel.type +
Gear.box.type + Drive.wheels + Wheel + Color + Airbags + New +
Turbo, data=training)
summary(linFit)##
## Call:
## lm(formula = Price ~ Fuel.idx + Mileage + Prod.year + Levy +
## Engine.volume + Manufacturer + Model + Category + Cylinders +
## Leather.interior + Fuel.type + Gear.box.type + Drive.wheels +
## Wheel + Color + Airbags + New + Turbo, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36378 -6554 -489 5642 36731
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.501e+06 5.338e+04 -46.849 < 2e-16 ***
## Fuel.idx 1.236e+04 3.513e+02 35.177 < 2e-16 ***
## Mileage -3.755e-07 2.074e-06 -0.181 0.856288
## Prod.year 1.250e+03 2.647e+01 47.223 < 2e-16 ***
## Levy -5.173e+00 2.600e-01 -19.895 < 2e-16 ***
## Engine.volume 2.376e+03 2.294e+02 10.357 < 2e-16 ***
## Manufacturer 2.778e+01 5.984e+00 4.642 3.48e-06 ***
## Model 6.849e-01 2.927e-01 2.340 0.019315 *
## Category -3.136e+02 3.353e+01 -9.352 < 2e-16 ***
## Cylinders -5.419e+02 1.395e+02 -3.885 0.000103 ***
## Leather.interior -5.854e+02 2.371e+02 -2.469 0.013560 *
## Fuel.type -1.979e+03 7.774e+01 -25.460 < 2e-16 ***
## Gear.box.type 2.571e+03 1.087e+02 23.658 < 2e-16 ***
## Drive.wheels 1.142e+03 1.869e+02 6.112 1.01e-09 ***
## Wheel -2.507e+03 3.657e+02 -6.856 7.42e-12 ***
## Color 3.772e+01 1.659e+01 2.273 0.023022 *
## Airbags -4.489e+02 2.419e+01 -18.556 < 2e-16 ***
## New -4.407e+03 5.030e+02 -8.762 < 2e-16 ***
## Turbo 2.828e+03 3.220e+02 8.783 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9553 on 12031 degrees of freedom
## Multiple R-squared: 0.3033, Adjusted R-squared: 0.3022
## F-statistic: 291 on 18 and 12031 DF, p-value: < 2.2e-16
Not a reasonable model.
Linear Regression
In this case, we will try to consider a new model to see if it’s better.
ctrl <- trainControl(method = "repeatedcv",
number = 5, repeats = 1)
ModelS = Price ~ Prod.year + Mileage + Cylinders + Engine.volume + Gear.box.type +
Fuel.type + Category + ManufacturerTrain
lm_tune <- train(ModelS, data = training,
method = "lm",
preProc=c('scale', 'center'),
trControl = ctrl)
lm_tune## Linear Regression
##
## 12050 samples
## 8 predictor
##
## Pre-processing: scaled (8), centered (8)
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 9639, 9639, 9640, 9641, 9641
## Resampling results:
##
## RMSE Rsquared MAE
## 10540.89 0.1506547 8431.032
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Predict
test_results <- data.frame(price = testing$Price)
test_results$lm <- predict(lm_tune, testing)
postResample(pred = test_results$lm, obs = test_results$price)## RMSE Rsquared MAE
## 1.059690e+04 1.425268e-01 8.457320e+03
Visualization
qplot(test_results$lm, test_results$price) +
labs(title="Linear Regression Observed VS Predicted", x="Predicted", y="Observed") +
lims(x = c(0, 50500), y = c(0, 50500)) +
geom_abline(intercept = 0, slope = 1, colour = "cyan") +
theme_bw()After applying this linear regression model, we still observe that we kind of reach the same conclusions, really low correlations!
Over fitted linear Regression
ModelFF = Price ~ Fuel.idx + Mileage + Prod.year + Levy + Engine.volume + Manufacturer+
Model + Category + Cylinders + Leather.interior + Fuel.type +
Gear.box.type + Drive.wheels + Wheel + Color + Airbags + New +
TurboTrain
alm_tune <- train(ModelFF, data = training,
method = "lm",
preProc=c('scale', 'center'),
trControl = ctrl)Predict
test_results$alm <- predict(alm_tune, testing)
postResample(pred = test_results$alm, obs = test_results$price)## RMSE Rsquared MAE
## 9764.7075204 0.2726752 7617.9727971
# even though it is still very low, we can see how the rsquared value has increasedVisualization
qplot(test_results$alm, test_results$price) +
labs(title="Linear Regression Observed VS Predicted", x="Predicted", y="Observed") +
lims(x = c(0, 50500), y = c(0, 50500)) +
geom_abline(intercept = 0, slope = 1, colour = "cyan") +
theme_bw()With the over fitted linear regression model we see that it performs better than with the previous model, as R2 has increased by 0.1 (but this is still very low). However, this model will not generalize well to new data. In other words, it performs well for training data, but when performing on testing data it will perform poorly.
Forward Regression
Train
for_tune <- train(ModelFF, data = training,
method = "leapForward",
preProc=c('scale', 'center'),
tuneGrid = expand.grid(nvmax = 4:10),
trControl = ctrl)
for_tune## Linear Regression with Forward Selection
##
## 12050 samples
## 18 predictor
##
## Pre-processing: scaled (18), centered (18)
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 9640, 9640, 9640, 9639, 9641
## Resampling results across tuning parameters:
##
## nvmax RMSE Rsquared MAE
## 4 10019.601 0.2325435 7922.638
## 5 9820.318 0.2630073 7690.351
## 6 9723.115 0.2774605 7639.617
## 7 9708.525 0.2797322 7632.513
## 8 9675.124 0.2846655 7605.876
## 9 9649.654 0.2884678 7593.986
## 10 9622.573 0.2923922 7579.291
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was nvmax = 10.
plot(for_tune)Predict
# variables that are selected
coef(for_tune$finalModel, for_tune$bestTune$nvmax)## (Intercept) Fuel.idx Prod.year Levy Category
## 14586.0495 3550.2349 4875.3344 -1891.5734 -697.1272
## Fuel.type Gear.box.type Wheel Airbags New
## -2867.0605 2455.9784 -819.6320 -1573.2975 -750.6098
## Turbo
## 770.5597
test_results$frw <- predict(for_tune, testing)
postResample(pred = test_results$frw, obs = test_results$price)## RMSE Rsquared MAE
## 9831.3195473 0.2627968 7669.0415601
Visualization
qplot(test_results$frw, test_results$price) +
labs(title="Forward Regression Observed VS Predicted", x="Predicted", y="Observed") +
lims(x = c(0, 50000), y = c(0, 50000)) +
geom_abline(intercept = 0, slope = 1, colour = "cyan") +
theme_bw()We can see that the higher the number of predictors, the better the accuracy. But there is no huge difference. Still nothing new for these related models…
Backward Regression
Train
back_tune <- train(ModelFF, data = training,
method = "leapBackward",
preProc=c('scale', 'center'),
tuneGrid = expand.grid(nvmax = 4:10),
trControl = ctrl)
back_tune## Linear Regression with Backwards Selection
##
## 12050 samples
## 18 predictor
##
## Pre-processing: scaled (18), centered (18)
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 9641, 9639, 9639, 9640, 9641
## Resampling results across tuning parameters:
##
## nvmax RMSE Rsquared MAE
## 4 10019.599 0.2329168 7922.251
## 5 9820.234 0.2629616 7689.203
## 6 9723.220 0.2773933 7638.404
## 7 9705.508 0.2800372 7631.584
## 8 9677.982 0.2840444 7611.935
## 9 9644.664 0.2889592 7584.385
## 10 9613.055 0.2935802 7562.309
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was nvmax = 10.
plot(back_tune)Predict
# variables that are selected
coef(back_tune$finalModel, back_tune$bestTune$nvmax)## (Intercept) Fuel.idx Prod.year Levy Engine.volume
## 14586.0495 3475.8132 5361.5362 -2194.3142 997.7935
## Category Fuel.type Gear.box.type Airbags New
## -732.2579 -2785.7817 2367.0933 -1799.4518 -783.1655
## Turbo
## 900.6762
test_results$bw <- predict(back_tune, testing)
postResample(pred = test_results$bw, obs = test_results$price)## RMSE Rsquared MAE
## 9820.602262 0.264235 7656.466569
Visualization
qplot(test_results$bw, test_results$price) +
labs(title="Backward Regression Observed VS Predicted", x="Predicted", y="Observed") +
lims(x = c(0, 50000), y = c(0, 500000)) +
geom_abline(intercept = 0, slope = 1, colour = "cyan") +
theme_bw()We are obtaining very similar conclusions than in forward regression.
Stepwise Regression
step_tune <- train(ModelFF, data = training,
method = "leapSeq",
preProc=c('scale', 'center'),
tuneGrid = expand.grid(nvmax = 4:10),
trControl = ctrl)
plot(step_tune)# which variables are selected?
coef(step_tune$finalModel, step_tune$bestTune$nvmax)## (Intercept) Fuel.idx Prod.year Levy Engine.volume
## 14586.0495 3475.8132 5361.5362 -2194.3142 997.7935
## Category Fuel.type Gear.box.type Airbags New
## -732.2579 -2785.7817 2367.0933 -1799.4518 -783.1655
## Turbo
## 900.6762
test_results$seq <- predict(step_tune, testing)
postResample(pred = test_results$seq, obs = test_results$price)## RMSE Rsquared MAE
## 9820.602262 0.264235 7656.466569
No conclusions reached with these statistical tools. Nevertheless, we see that in order to reduce the RMSE, the best number of predictors was 9. But still not the numbers we are looking for in RMSE and R2 when talking about the testing set.
Ridge Regression
# X matrix
X = model.matrix(ModelFF, data=training)[,-1] # skip column of ones
# y variable
y = training$Pricegrid = seq(0, .1, length = 100) # a 100-size grid for lambda (rho in slides)
ridge.mod = glmnet(X, y, alpha=0, lambda=grid) # alpha=0 for ridge regression
dim(coef(ridge.mod))## [1] 19 100
ridge.cv = cv.glmnet(X, y, type.measure="mse", alpha=0)
plot(ridge.cv)opt.lambda <- ridge.cv$lambda.min
opt.lambda## [1] 302.7303
lambda.index <- which(ridge.cv$lambda == ridge.cv$lambda.1se)
beta.ridge <- ridge.cv$glmnet.fit$beta[, lambda.index]
beta.ridge## Fuel.idx Mileage Prod.year Levy
## 9.916733e+03 -5.152177e-07 9.580160e+02 -3.486091e+00
## Engine.volume Manufacturer Model Category
## 1.369622e+03 1.513909e+01 8.200371e-01 -2.580414e+02
## Cylinders Leather.interior Fuel.type Gear.box.type
## -4.257498e+02 -1.863170e+02 -1.538917e+03 2.136339e+03
## Drive.wheels Wheel Color Airbags
## 9.842520e+02 -2.932697e+03 3.132923e+01 -3.623346e+02
## New Turbo
## -4.158524e+03 3.042682e+03
X.test = model.matrix(ModelFF, data=testing)[,-1] # skip column of ones
ridge.pred = predict(ridge.cv$glmnet.fit, s=opt.lambda, newx=X.test)
y.test = log(testing$Price)
postResample(pred = ridge.pred, obs = y.test)## RMSE Rsquared MAE
## 15644.068649 0.135503 14487.778333
In this case we have dealt with another tool to improve our model by selecting the best possible betas. However, we reach the same conclusions when evaluating. Indeed, this is not going to be our final choice. We have used the minimal lambda since it is the most interesting for our purposes.
With Caret
# the grid for lambda
ridge_grid <- expand.grid(lambda = seq(0, .1, length = 100))Train
ridge_tune <- train(ModelFF, data = training,
method='ridge',
preProc=c('scale','center'),
tuneGrid = ridge_grid,
trControl=ctrl)
plot(ridge_tune)Best Tune
ridge_tune$bestTune## lambda
## 4 0.003030303
Prediction
test_results$ridge <- predict(ridge_tune, testing)
postResample(pred = test_results$ridge, obs = test_results$price)## RMSE Rsquared MAE
## 9764.3623637 0.2726683 7617.7512353
Similar results, with just a little increase on the R2. Again, it is very low.
Lasso
lasso_grid <- expand.grid(fraction = seq(.01, 1, length = 100))Train
lasso_tune <- train(ModelFF, data = training,
method='lasso',
preProc=c('scale','center'),
tuneGrid = lasso_grid,
trControl=ctrl)
plot(lasso_tune)Best Tune
lasso_tune$bestTune## fraction
## 100 1
Prediction
test_results$lasso <- predict(lasso_tune, testing)
postResample(pred = test_results$lasso, obs = test_results$price)## RMSE Rsquared MAE
## 9764.7075204 0.2726752 7617.9727971
With the regularization technique Lasso, we see that we obtain a more accurate prediction as the R2 is 0.3 (but again, this is too low for a model).
Elastic Net
modelLookup('glmnet')## model parameter label forReg forClass probModel
## 1 glmnet alpha Mixing Percentage TRUE TRUE TRUE
## 2 glmnet lambda Regularization Parameter TRUE TRUE TRUE
elastic_grid = expand.grid(alpha = seq(0, .2, 0.01), lambda = seq(0, .1, 0.01))Train
glmnet_tune <- train(ModelFF, data = training,
method='glmnet',
preProc=c('scale','center'),
tuneGrid = elastic_grid,
trControl=ctrl)
plot(glmnet_tune)Best Tune
glmnet_tune$bestTune## alpha lambda
## 231 0.2 0.1
Prediction
test_results$glmnet <- predict(glmnet_tune, testing)
postResample(pred = test_results$glmnet, obs = test_results$price)## RMSE Rsquared MAE
## 9763.9481589 0.2726324 7616.7303021
As we can see, any of the techniques used above have been useful for improving our linear model approach. The parameters to evaluate the efficiency of the model have remained as in the beginning, although the r-squared has increased by 0.2 (not much though). Therefore, we are going to try different models in order to achieve the best one.
MACHINE LEARNING TOOLS
KNN
modelLookup('kknn')## model parameter label forReg forClass probModel
## 1 kknn kmax Max. #Neighbors TRUE TRUE TRUE
## 2 kknn distance Distance TRUE TRUE TRUE
## 3 kknn kernel Kernel TRUE TRUE TRUE
# 3 hyper-parameters: kmax, distance, kernel
# kmax: number of neighbors considered
# distance: parameter of Minkowski distance (p in Lp)
# kernel: "rectangular" (standard unweighted knn), "triangular", "epanechnikov" (or beta(2,2)), "biweight" (or beta(3,3)), "tri- weight" (or beta(4,4)), "cos", "inv", "gaussian", "rank" and "optimal".knn_tune <- train(ModelS,
data = training,
method = "kknn",
preProc=c('scale','center'),
tuneGrid = data.frame(kmax=c(2,5,8,12),distance=2,kernel='optimal'),
trControl = ctrl)
plot(knn_tune)test_results$knn <- predict(knn_tune, testing)
postResample(pred = test_results$knn, obs = test_results$price)## RMSE Rsquared MAE
## 7255.9567715 0.6017215 4627.6160497
Computationally, this has been a very expensive process. There is no doubt that it has been profitable since we have decreased the RMSE and improved in nearly 60% the r-squared. This r-squared value is not good enough, but we have also taken into account that these parameters are going to be lower than in classification because the number of possible outcomes is much higher.
Random Forest
rf_tune <- train(ModelS,
data = training,
method = "rf",
preProc=c('scale','center'),
trControl = ctrl,
ntree = 100,
tuneGrid = data.frame(mtry=c(1,3,5,7)),
importance = TRUE)
plot(rf_tune)test_results$rf <- predict(rf_tune, testing)postResample(pred = test_results$rf, obs = test_results$price)## RMSE Rsquared MAE
## 6648.5219271 0.6641209 4437.5440699
# variable importance
plot(varImp(rf_tune, scale = F), scales = list(y = list(cex = .95)))Again, it has been a really expensive process, but even more efficient. Random Forest takes into account the most correlated variable, Prod.year, as the most important but it also gives importance to others. The RMSE is smaller and the r-squared has increased significantly for our data base. Now we almost have 70% of R2, when we started with10-30% !!
The linear approaches has been improved, as well as the KNN. So far, this has been the best model.
Gradient Boosting
xgb_tune <- train(ModelS,
data = training,
method = "xgbTree",
preProc=c('scale','center'),
objective="reg:squarederror",
trControl = ctrl,
tuneGrid = expand.grid(nrounds = c(10,50), max_depth = c(5,6,7), eta = c(0.01, 0.1, 1),
gamma = c(1, 2, 3), colsample_bytree = c(1, 2),
min_child_weight = c(1), subsample = c(0.2,0.5,0.8)))
test_results$xgb <- predict(xgb_tune, testing)postResample(pred = test_results$xgb, obs = test_results$price)## RMSE Rsquared MAE
## 7297.7494915 0.5964994 5257.4270377
As it happened previously, in terms of computation, this process has been really expensive and difficult to overcome random forest. For sure, this is not going to be our model as it is not reliable. Despite the fact that it is a bit better than linear approaches, it is really far from the other machine learning tools.
Ensemble
apply(test_results[-1], 2, function(x) mean(abs(x - test_results$price)))## lm alm frw bw seq ridge lasso glmnet
## 8457.320 7617.973 7669.042 7656.467 7656.467 7617.751 7617.973 7616.730
## knn rf xgb
## 4627.616 4437.544 5257.427
# Combination
test_results$comb = (test_results$alm + test_results$knn + test_results$rf)/3
postResample(pred = test_results$comb, obs = test_results$price)## RMSE Rsquared MAE
## 7116.5643675 0.6325901 5122.6511235
Comparing it with other models, and taking into account that our data is not too well correlated, it is not a bad model. However, in this particular case, the model doesn’t improve the performance obtained from random forest (best model for our database). It is true that it is very close though. Right now, the only two candidates are random forest and knn, with the highest efficiency (which is pretty similar for both).
Final predictions
yhat = test_results$comb
head(yhat) # show the prediction for 6 car prices## [1] 10630.1362 3897.2514 12599.4401 40.4918 11523.9037 13522.6770
par(mfrow=c(1,1))
hist(yhat, col="lightblue")Prediction intervals
The errors are more symmetric.
y = test_results$price
error = y-yhat
hist(error, col="mediumpurple3")Machine Learning tools don’t provide us prediction intervals, but we can split the testing set in two: one for measuring the size of the noise and another for computing the intervals from that size.
Let’s consider the first 1000 cars in testing to compute the noise size.
For the prediction intervals we will fix a 90% confidence.
noise = error[1:1000]
lwr = yhat[101:length(yhat)] + quantile(noise,0.05, na.rm=T)
upr = yhat[101:length(yhat)] + quantile(noise,0.95, na.rm=T)
predictions = data.frame(real=y[101:length(y)], fit=yhat[101:length(yhat)], lwr=lwr, upr=upr)
predictions = predictions %>% mutate(out=factor(if_else(real<lwr | real>upr,1,0)))
# how many real observations are out of the intervals?
mean(predictions$out==1)## [1] 0.09913132
For the performance, we use the last prices in yhat.
In the plot we can see the observations that are out of the intervals.
ggplot(predictions, aes(x=fit, y=real))+
geom_point(aes(color=out)) + theme(legend.position="none") +
xlim(0, 20000) + ylim(0, 20000)+
geom_ribbon(data=predictions,aes(ymin=lwr,ymax=upr),alpha=0.3) +
labs(title = "Prediction intervals", x = "prediction",y="real price")Conclusion
During this analysis, we have started analyzing the cars data set seeking for relations and tendencies through the years. We have later focused on the 21st Century Cars for a better comparison and understanding of the prices and the pollution of the cars.
We have been able to see relations with the colors of the cars, the different categories, models and manufacturers. Also how the fuel types affect pollution, the relation between cylinders and Levy, the engine volume influence too…However, we have mainly focused on classification and prediction.
After evaluating different approaches, we have created a model for classifying the level of contribution to pollution of a car (“HIGH”, “MEDIUM” or “LOW”) with a very good accuracy thanks to random forest. On the contrary, despite the dependency on a set of features, decision trees are also helpful when interpreting a model in an easier and faster way. The high value of accuracy is due to the fact that pollution is a variable that we have created at the beginning from other variables and also taking into account CO2 emissions.
For regression, we have seen that the variables, in general, did not explain much of the variation of the Price. Through different models, we have seen that the accuracy and the r-squared values were not good at all. But it is remarkable how with random forest we were capable of increasing these values. Although we didn’t reach a great r-squared value for the model, we did increase it by a lot, which is a proof of how well random forest performances are, providing much more accurate models.
After evaluating models in classification and regression, we have obtained that random forest is the best model. Taking all the evaluation tools into account, random forest has been by far the one with the best performance. This conclusion is due to the fact that our database is non linear and has very low correlations. As a result, we can say that random forest is one of the best tools for these purpose since it is very precise and can generalize over the data very well. Even if there are a lot of levels from where to choose, it achieves a reasonable accuracy.